Synthetic nucleic acid molecule compositions and methods of preparation

ABSTRACT

A method to prepare synthetic nucleic acid molecules having reduced inappropriate or unintended transcriptional characteristics when expressed in a particular host cell.

STATEMENT OF GOVERNMENT RIGHTS

The invention was made at least in part with a grant from the Governmentof the United States of America (grant DMI-9402762 from the NationalScience Foundation). The Government has certain rights to the invention.

BACKGROUND OF THE INVENTION

Transcription, the synthesis of an RNA molecule from a sequence of DNAis the first step in gene expression. Sequences which regulate DNAtranscription include promoter sequences, polyadenylation signals,transcription factor binding sites and enhancer elements. A promoter isa DNA sequence capable of specific initiation of transcription andconsists of three general regions. The core promoter is the sequencewhere the RNA polymerase and its cofactors bind to the DNA. Immediatelyupstream of the core promoter is the proximal promoter which containsseveral transcription factor binding sites that are responsible for theassembly of an activation complex that in turn recruits the polymerasecomplex. The distal promoter, located further upstream of the proximalpromoter also contains transcription factor binding sites. Transcriptiontermination and polyadenylation, like transcription initiation, are sitespecific and encoded by defined sequences. Enhancers are regulatoryregions, containing multiple transcription factor binding sites, thatcan significantly increase the level of transcription from a responsivepromoter regardless of the enhancer's orientation and distance withrespect to the promoter as long as the enhancer and promoter are locatedwithin the same DNA molecule. The amount of transcript produced from agene may also be regulated by a post-transcriptional mechanism, the mostimportant being RNA splicing that removes intervening sequences(introns) from a primary transcript between splice donor and spliceacceptor sequences.

Natural selection is the hypothesis that genotype-environmentinteractions occurring at the phenotypic level lead to differentialreproductive success of individuals and therefore to modification of thegene pool of a population. Some properties of nucleic acid moleculesthat are acted upon by natural selection include codon usage frequency,RNA secondary structure, the efficiency of intron splicing, andinteractions with transcription factors or other nucleic acid bindingproteins. Because of the degenerate nature of the genetic code, theseproperties can be optimized by natural selection without altering thecorresponding amino acid sequence.

Under some conditions, it is useful to synthetically alter the naturalnucleotide sequence encoding a polypeptide to better adapt thepolypeptide for alternative applications. A common example is to alterthe codon usage frequency of a gene when it is expressed in a foreignhost cell. Although redundancy in the genetic code allows amino acids tobe encoded by multiple codons, different organisms favor some codonsover others. It has been found that the efficiency of proteintranslation in a non-native host cell can be substantially increased byadjusting the codon usage frequency but maintaining the same geneproduct (U.S. Pat. Nos. 5,096,825, 5,670,356, and 5,874,304).

However, altering codon usage may, in turn, result in the unintentionalintroduction into a synthetic nucleic acid molecule of inappropriatetranscription regulatory sequences. This may adversely effecttranscription, resulting in anomalous expression of the synthetic DNA.Anomalous expression is defined as departure from normal or expectedlevels of expression. For example, transcription factor binding siteslocated downstream from a promoter have been demonstrated to effectpromoter activity (Michael et al., 1990; Lamb et al., 1998; Johnson etal., 1998; Jones et al., 1997). Additionally, it is not uncommon for anenhancer element to exert activity and result in elevated levels of DNAtranscription in the absence of a promoter sequence or for the presenceof transcription regulatory sequences to increase the basal levels ofgene expression in the absence of a promoter sequence.

Thus, what is needed is a method for making synthetic nucleic acidmolecules with altered codon usage without also introducinginappropriate or unintended transcription regulatory sequences forexpression in a particular host cell.

SUMMARY OF THE INVENTION

The invention provides a synthetic nucleic acid molecule comprising atleast 300 nucleotides of a coding region for a polypeptide, having acodon composition differing at more than 25% of the codons from a wildtype nucleic acid sequence encoding a polypeptide, and having at least3-fold fewer, preferably at least 5-fold fewer, transcription regulatorysequences than would result if the differing codons were randomlyselected. Preferably, the synthetic nucleic acid molecule encodes apolypeptide that has an amino acid sequence that is at least 85%,preferably 90%, and most preferably 95% or 99% identical to the aminoacid sequence of the naturally-occurring (native or wild type)polypeptide (protein) from which it is derived. Thus, it is recognizedthat some specific amino acid changes may also be desirable to alter aparticular phenotypic characteristic of the polypeptide encoded by thesynthetic nucleic acid molecule. Preferably, the amino acid sequenceidentity is over at least 100 contiguous amino acid residues. In oneembodiment of the invention, the codons in the synthetic nucleic acidmolecule that differ preferably encode the same amino acids as thecorresponding codons in the wild type nucleic acid sequence.

The transcription regulatory sequences which are reduced in thesynthetic nucleic acid molecule include, but are not limited to, anycombination of transcription factor binding sequences, intron splicesites, poly(A) addition sites, enhancer sequences and promotersequences. Transcription regulatory sequences are well known in the art.

It is preferred that the synthetic nucleic acid molecule of theinvention has a codon composition that differs from that of the wildtype nucleic acid sequence at more than 30%, 35%, 40% or more than 45%,e.g., 50%, 55%, 60% or more of the codons. Preferred codons for use inthe invention are those which are employed more frequently than at leastone other codon for the same amino acid in a particular organism and,more preferably, are also not low-usage codons in that organism and arenot low-usage codons in the organism used to clone or screen for theexpression of the synthetic nucleic acid molecule (for example, E.coli). Moreover, preferred codons for certain amino acids (i.e., thoseamino acids that have three or more codons,), may include two or morecodons that are employed more frequently than the other (non-preferred)codon(s). The presence of codons in the synthetic nucleic acid moleculethat are employed more frequently in one organism than in anotherorganism results in a synthetic nucleic acid molecule which, whenintroduced into the cells of the organism that employs those codons morefrequently, is expressed in those cells at a level that is greater thanthe expression of the wild type or parent nucleic acid sequence in thosecells. For example, the synthetic nucleic acid molecule of the inventionis expressed at a level that is at least about 110%, e.g., 150%, 200%,500% or more (1000%, 5000%, or 10000%) of that of the wild type nucleicacid sequence in a cell or cell extract under identical conditions (suchas cell culture conditions, vector backbone, and the like).

In one embodiment of the invention, the codons that are different arethose employed more frequently in a mammal, while in another embodimentthe codons that are different are those employed more frequently in aplant. A particular type of mammal, e.g., human, may have a differentset of preferred codons than another type of mammal. Likewise, aparticular type of plant may have a different set of preferred codonsthan another type of plant. In one embodiment of the invention. themajority of the codons which differ are ones that are preferred codonsin a desired host cell. Preferred codons for mammals (e.g., humans) andplants are known to the art (e.g., Wada et al., 1990). For example,preferred human codons include, but are not limited to, CGC (Arg), CTG(Leu), TCT (Ser), AGC (Ser), ACC (Thr), CCA (Pro), CCT (Pro), GCC (Ala),GGC (Gly), GTG (Val), ATC (Ile), ATT (Ile), AAG (Lys), AAC (Asn), CAG(Gln), CAC (His), GAG (Glu), GAC (Asp), TAC (Tyr), TGC (Cys) and TTC(Phe) (Wada et al., 1990). Thus, preferred “humanized” synthetic nucleicacid molecules of the invention have a codon composition which differsfrom a wild type nucleic acid sequence by having an increased number ofthe preferred human codons, e.g. CGC, CTG, TCT, AGC, ACC, CCA, CCT, GCC,GGC, GTG, ATC, ATT, AAG, AAC, CAG, CAC, GAG, GAC, TAC, TGC, TTC, or anycombination thereof. For example, the synthetic nucleic acid molecule ofthe invention may have an increased number of CTG or TTGleucine-encoding codons, GTG or GTC valine-encoding codons, GGC or GGTglycine-encoding codons, ATC or ATT isoleucine-encoding codons, CCA orCCT proline-encoding codons, CGC or CGT arginine-encoding codons, AGC orTCT serine-encoding codons, ACC or ACT threonine-encoding codon, GCC orGCT alanine-encoding codons, or any combination thereof, relative to thewild type nucleic acid sequence. Similarly, synthetic nucleic acidmolecules having an increased number of codons that are employed morefrequently in plants, have a codon composition which differs from a wildtype or parent nucleic acid sequence by having an increased number ofthe plant codons including, but not limited to, CGC (Arg), CTT (Leu),TCT (Ser), TCC (Ser), ACC (Thr), CCA (Pro), CCT (Pro), GCT (Ser), GGA(Gly), GTG (Val), ATC (Ile), ATT (Ile), AAG (Lys), AAC (Asn), CAA (Gln),CAC (His), GAG (Glu), GAC (Asp), TAC (Tyr), TGC (Cys), TTC (Phe), or anycombination thereof (Murray et al., 1989). Preferred codons may differfor different types of plants (Wada et al., 1990).

The choice of codon may be influenced by many factors such as, forexample, the desire to have an increased number of nucleotidesubstitutions or decreased number of transcription regulatory sequences.Under some circumstances (e.g. to permit removal of a transcriptionfactor binding site) it may be desirable to replace a non-preferredcodon with a codon other than a preferred codon or a codon other thanthe most preferred codon. Under other circumstances, for example, toprepare codon distinct versions of a synthetic nucleic acid molecule,preferred codon pairs are selected based upon the largest number ofmismatched bases, as well as the criteria described above.

The presence of codons in the synthetic nucleic acid molecule that areemployed more frequently in one organism than in another organism,results in a synthetic nucleic acid molecule which, when introduced intoa cell of the organism that employs those codons, is expressed in thatcell at a level which is greater than the level of expression of thewild type or parent nucleic acid sequence.

A synthetic nucleic acid molecule of the invention may encode aselectable marker protein or a reporter molecule. However, the inventionapplies to any gene and is not limited to synthetic reporter genes orsynthetic selectable marker genes. In one embodiment of a syntheticnucleic acid molecule of the invention that is a reporter molecule, thesynthetic nucleic acid molecule encodes a luciferase having a codoncomposition different than that of a wild type or parent Renillaluciferase or a beetle luciferase nucleic acid sequence. A syntheticclick beetle luciferase nucleic acid molecule of the invention mayoptionally encode the amino acid valine at position 224 (i.e., it emitsgreen light), or may optionally encode the amino acid histidine atposition 224, histidine at position 247, isoleucine at position 346,glutamine at position 348 or combination thereof (i.e., it emits redlight). Preferred synthetic luciferase nucleic acid molecules that arerelated to a wild type Renilla luciferase nucleic acid sequence include,but are not limited to, SEQ ID NO:21 (Rlucver2) or SEQ ID NO:22(Rluc-final). Preferred synthetic luciferase nucleic acid molecules thatare related to click beetle luciferase nucleic acid sequences include,but are not limited to, SEQ ID NO:7 (GRver5), SEQ ID NO:8 (GR6), SEQ IDNO:9 (GRver5.1), SEQ ID NO:14 (RDver5), SEQ ID NO:15 (RD7), SEQ ID NO:16(RDver5.1), SEQ ID NO:17 (RDver5.2) or SEQ ID NO:18 (RD156-1H9).

The invention also provides an expression cassette. The expressioncassette of the invention comprises a synthetic nucleic acid molecule ofthe invention operatively linked to a promoter that is functional in acell. Preferred promoters are those functional in mammalian cells andthose functional in plant cells. Optionally, the expression cassette mayinclude other sequences, e.g., restriction enzyme recognition sequencesand a Kozak sequence, and be a part of a larger polynucleotide moleculesuch as a plasmid, cosmid, artificial chromosome or vector, e.g., aviral vector.

Also provided is a host cell comprising the synthetic nucleic acidmolecule of the invention, an isolated polypeptide (e.g., a fusionpolypeptide encoded by the synthetic nucleic acid molecule of theinvention), and compositions and kits comprising the synthetic nucleicacid molecule of the invention or the polypeptide encoded thereby insuitable container means and, optionally, instruction means. Preferredisolated polypeptides include, but are not limited to, those comprisingSEQ ID NO:31 (GRver5.1), SEQ ID NO:226 (Rluc-final), or SEQ ID NO:223(RD156-1H9).

The invention also provides a method to prepare a synthetic nucleic acidmolecule of the invention by genetically altering a parent (either awild type or another synthetic) nucleic acid sequence. The method may beused to prepare a synthetic nucleic acid molecule encoding a polypeptidecomprising at least 100 amino acids. One embodiment of the invention isdirected to the preparation of synthetic genes encoding reporter orselectable marker proteins. The method of the invention may be employedto alter the codon usage frequency and decrease the number oftranscription regulatory sequences in any open reading frame or todecrease the number of transcription regulatory sites in a vectorbackbone. Preferably, the codon usage frequency in the synthetic nucleicacid molecule is altered to reflect that of the host organism desiredfor expression of that nucleic acid molecule while also decreasing thenumber of potential transcription regulatory sequences relative to theparent nucleic acid molecule.

Thus, the invention provides a method to prepare a synthetic nucleicacid molecule comprising an open reading frame. The method comprisesaltering (e.g., decreasing or eliminating) a plurality of transcriptionregulatory sequences in a parent (wild type or a synthetic) nucleic acidsequence that encodes a polypeptide having at least 100 amino acids toyield a synthetic nucleic acid molecule which has a decreased number oftranscription regulatory sequences and which preferably encodes the sameamino acids as the parent nucleic acid molecule. The transcriptionregulatory sequences are selected from the group consisting oftranscription factor binding sequences, intron splice sites, poly(A)addition sites, enhancer sequences and promoter sequences, and theresulting synthetic nucleic acid molecule has at least 3-fold fewer,preferably 5-fold fewer, transcription regulatory sequences relative tothe parent nucleic acid sequence. The method also comprises alteringgreater than 25% of the codons in the synthetic nucleic acid sequencewhich has a decreased number of transcription regulatory sequences toyield a further synthetic nucleic acid molecule, wherein the codons thatare altered encode the same amino acids as those in the correspondingposition in the synthetic nucleic acid molecule which has a decreasednumber of transcription regulatory sequences and/or in the parentnucleic acid sequence. Preferably, the codons which are altered do notresult in an increase in transcriptional regulatory sequences.Preferably, the further synthetic nucleic acid molecule encodes apolypeptide that has at least 85%, preferably 90%, and most preferably95% or 99% contiguous amino acid sequence identity to the amino acidsequence of the polypeptide encoded by the parent nucleic acid sequence.

Alternatively, the method comprises altering greater than 25% of thecodons in a parent nucleic acid sequence which encodes a polypeptidehaving at least 100 amino acids to yield a codon-altered syntheticnucleic acid molecule, wherein the codons that are altered encode thesame amino acids as those present in the corresponding positions in theparent nucleic acid sequence. Then, a plurality of transcriptionregulatory sequences in the codon-altered synthetic nucleic acidmolecule are altered to yield a further synthetic nucleic acid molecule.Preferably, the codons which are altered do not result in an increase intranscriptional regulatory sequences. Also, preferably, the furthersynthetic nucleic acid molecule encodes a polypeptide that has at least85%, preferably 90%, and most preferably 95% or 99% contiguous aminoacid sequence identity to the amino acid sequence of the polypeptideencoded by the parent nucleic acid sequence. Also provided is asynthetic (including a further synthetic) nucleic acid molecule preparedby the methods of the invention.

As described hereinbelow, the methods of the invention were employedwith click beetle luciferase and Renilla luciferase nucleic acidsequences. While both of these nucleic acid molecules encode luciferaseproteins, they are from entirely different families and are widelyseparated evolutionarily. These proteins have unrelated amino acidsequences, protein structures, and they utilize dissimilar chemicalsubstrates. The fact that they share the name “luciferase” should not beinterpreted to mean that they are from the same family, or even largelysimilar families. The methods produced synthetic luciferase nucleic acidmolecules which exhibited significantly enhanced levels of mammalianexpression without negatively effecting other desirable physical orbiochemical properties (including protein half-life) and which were alsolargely devoid of known transcription regulatory elements.

The invention also provides at least two synthetic nucleic acidmolecules that encode highly related polypeptides, but which syntheticnucleic acid molecules have an increased number of nucleotidedifferences relative to each other. These differences decrease therecombination frequency between the two synthetic nucleic acid moleculeswhen those molecules are both present in a cell (i.e., they are “codondistinct” versions of a synthetic nucleic acid molecule). Thus, theinvention provides a method for preparing at least two synthetic nucleicacid molecules that are codon distinct versions of a parent nucleic acidsequence that encodes a polypeptide. The method comprises altering aparent nucleic acid sequence to yield a first synthetic nucleic acidmolecule having an increased number of a first plurality of codons thatare employed more frequently in a selected host cell relative to thenumber of those codons present in the parent nucleic acid sequence.Optionally, the first synthetic nucleic acid molecule also has adecreased number of transcription regulatory sequences relative to theparent nucleic acid sequence. The parent nucleic acid sequence is alsoaltered to yield a second synthetic nucleic acid molecule having anincreased number of a second plurality of codons that are employed morefrequently in the host cell relative to the number of those codons inthe parent nucleic acid sequence, wherein the first plurality of codonsis different than the second plurality of codons, and wherein the firstand the second synthetic nucleic acid molecules preferably encode thesame polypeptide. Optionally, the second synthetic nucleic acid moleculehas a decreased number of transcription regulatory sequences relative tothe parent nucleic acid sequence. Either or both synthetic molecules canthen be further modified.

Clearly, the present invention has applications with many genes andacross many fields of science including, but not limited to, lifescience research, agrigenetics, genetic therapy, developmental scienceand pharmaceutical development.

BRIEF DESCRIPTION OF THE FIGURES

FIG. 1. Codons and their corresponding amino acids.

FIG. 2. A nucleotide sequence comparison of a yellow-green (YG) clickbeetle luciferase nucleic acid sequence (YG #81-6G01; SEQ ID NO:2) andvarious synthetic green (GR) click beetle luciferase nucleic acidsequences (GRver1, SEQ ID NO:3; GRver2, SEQ ID NO:4; GRver3, SEQ IDNO:5; GRver4, SEQ ID NO:6; GRver5, SEQ ID NO:7; GR6, SEQ ID NO:8;GRver5.1, SEQ ID NO:9) and various red (RD) click beetle luciferasenucleic acid sequences (RDver1, SEQ ID NO: 10; RDver2, SEQ ID NO:11;RDver3, SEQ ID NO:12; RDver4, SEQ ID NO:13; RDver5, SEQ ID NO:14; RD7,SEQ ID NO:15; RDver5.1, SEQ ID NO:16; RDver5.2, SEQ ID NO:17; RD156-1H9,SEQ ID NO:18). The nucleotides enclosed in boxes are nucleotides thatdiffer from the nucleotide present at the homologous position in SEQ IDNO:2.

FIG. 3. An amino acid sequence comparison of a YG click beetleluciferase amino acid sequence (YG#81-6G01, SEQ ID NO:24) and varioussynthetic GR click beetle luciferase amino acid sequences (GRver1, SEQID NO:25; GRver2, SEQ ID NO:26; GRver3, SEQ ID NO:27; GRver4, SEQ IDNO:28; GRver5, SEQ ID NO:29; GR6, SEQ ID NO:30; GRver5.1, SEQ ID NO:31)and various red (RD) click beetle luciferase amino acid sequences(RDver1, SEQ ID NO:32; RDver2, SEQ ID NO:33; RDver3, SEQ ID NO:34;RDver4, SEQ ID NO:218; RDver5, SEQ ID NO:219; RD7, SEQ ID NO:220;RDver5.1, SEQ ID NO:221; RDver5.2, SEQ ID NO:222; RD156-1H9, SEQ IDNO:223). All amino acid sequences are inferred from the correspondingnucleotide sequence. The amino acids enclosed in boxes are amino acidsthat differ from the amino acid present at the homologous position inSEQ ID NO:24.

FIG. 4. Codon usage in YG#81-6G01, GRver1, RDver1, GRver5, and RDver5,and humans (HUM) and relative codon usage in YG#81-6G01, GRver5, RDver5,and humans.

FIG. 5. Codon usage summaries for YG#81-6G01 (FIG. 5A), and GR/RDsynthetic nucleic acid sequences, GRver1 (FIG. 5B), RDver1 (FIG. 5C),GRver2 (FIG. 5D), RDver2 (FIG. 5E), GRver3 (FIG. 5F), RDver3 (FIG. 5G),GRver4 (FIG. 5H), RDver4 (FIG. 5I), GRver5 (FIG. 5J), RDver5 (5K).

FIG. 6. Oligonucleotides employed to prepare synthetic GR/RD luciferasegenes (SEQ ID Nos. 35-245).

FIG. 7. A nucleotide sequence comparison of a wild type Renillareniformis luciferase nucleic acid sequence Genbank Accession No. M63501(RELLUC, SEQ ID NO:19) and various synthetic Renilla luciferase nucleicacid sequences (Rlucver1, SEQ ID NO:20; Rlucver2, SEQ ID NO:21;Rluc-final, SEQ ID NO:22). The nucleotides enclosed in boxes arenucleotides that differ from the nucleotide present at the homologousposition in SEQ ID NO:19.

FIG. 8. An amino acid sequence comparison of a wild type Renillareniformis luciferase amino acid sequence (RELLUC, SEQ ID NO:224) andvarious synthetic Renilla reniformis luciferase amino acid sequences(Rlucver1, SEQ ID NO:225; Rlucver2, SEQ ID NO:226; Rluc-final, SEQ IDNO:227). All amino acid sequences are inferred from the correspondingnucleotide sequence. The amino acids enclosed in boxes are amino acidsthat differ from the amino acid present at the homologous position inSEQ ID NO:224.

FIG. 9. Codon usage in wild-type (A) versus synthetic (B) Renillaluciferase genes. For codon usage in selected organisms, see, e.g., Wadaet al., 1990; Sharp et al., 1988; Aota et al., 1988; and Sharp et al.,1987, and for plant codons, Murray et al. 1989.

FIG. 10. Oligonucleotides employed to prepare synthetic Renillaluciferase gene (SEQ ID Nos. 246-292).

FIG. 11. A nucleotide sequence comparison of a wild type yellow-green(YG) click beetle luciferase nucleic acid sequence (LUCPPLYG, SEQ IDNO:1) and the synthetic green click beetle luciferase nucleic acidsequences (GRver5.1, SEQ ID NO:9) and the synthetic red click beetleluciferase nucleic acid sequences (RD156-1H9, SEQ ID NO:18). Thenucleotides enclosed in boxes are nucleotides that differ from thenucleotide present at the homologous position in SEQ ID NO:1. Bothsynthetic sequences have a codon composition that differs from LUCPPLYGat more than 25% of the codons and have at least 3-fold fewertranscription regulatory sequences relative to a random selection ofcodons at the codons which differ.

FIG. 12. An amino acid sequence comparison of a wild type YG clickbeetle luciferase amino acid sequence (LUCPPLYG, SEQ ID NO:23) and thesynthetic GR click beetle luciferase amino acid sequences (GRver5. 1,SEQ ID NO:31) and the red (RD) click beetle luciferase amino acidsequences (RD156-1H9, SEQ ID NO:223). All amino acid sequences areinferred from the corresponding nucleotide sequence. The amino acidsenclosed in boxes are amino acids that differ from the amino acidpresent at the homologous position in SEQ ID NO:23.

FIG. 13. pRL vector series. All of the vectors contain the Renilla wildtype or synthetic gene as further described herein. FIG. 13A illustratesthe Renilla luciferase gene in the pGL3 vectors (Promega Corp.) FIG. 13Billustrates the Renilla luciferase co-reporter vector series. pRL-TK hasthe herpes simplex virus (HSV) tk promoter; pRL-SV40 has the SV40 virusearly enhancer/promoter; pRL-CMV has the cytomegalovirus (CMV) enhancerand immediate early promoter; pRL-null has MCS (multiple cloning sites)but no promoter or enhancer; pRL-TK(Int⁻) has HSV/tk promoter without anintron that is present in the other plasmids; pR-GL3B has the pGL-3Basic backbone (Promega Corp.); pR-GL3 TK has the pGL3-Basic backbonewith an HSV tk promoter.

FIG. 14. Half-life of synthetic (Rluc-final) and native Renillaluciferases in CHO cells.

FIGS. 15A-B. In vitro transcription/translation of Renilla luciferasenucleic acid sequences. A) t=0-60 minutes; B) linear range.

FIGS. 15C-D. In vitro translation of native and synthetic (Rluc-final)Renilla luciferase RNAs in a rabbit reticulocyte lysate. RNA wasquantitated and the same amount was employed as in the translationreaction shown in FIGS. 15A-B. C) t=0-60 minutes; D) linear range.

FIGS. 15E-F. Translation of native and synthetic (Rluc-final) RenillaRNAs in a wheat germ extract. E) t=0-60 minutes; F) linear range.

FIG. 16. High expression from a synthetic Renilla nucleic acid sequencereduces the risk of promoter interference in a co-transfection assay.CHO cells were co-transfected with a constant amount (50 ng) of fireflyluciferase expression vector (pGL3 control vector, with SV40 promoterand enhancer; Luc+) and a pRL vector having a native (0 ng, 50 ng, 100ng, 500 ng, 1 μg or 2 μg) or synthetic (0 ng, 5 ng, 10 ng, 50 ng, 100 ngor 200 ng) Renilla luciferase gene.

FIGS. 17A-B. Illustrates the reactions catalyzed by firefly and clickbeetle (17A), and Renilla (17B) luciferases.

FIG. 18. Nucleotide and inferred amino acid sequence of click beetleluciferases in pGL3 vectors (GRver5.1 in pGL3, SEQ ID NO:297 encodingSEQ ID NO:298; RDver5.1 in pGL3, SEQ ID NO:299 encoding SEQ ID NO:300;and RD156-1H9 in pGL3, SEQ ID NO:301 encoding SEQ ID NO:302). To cloneGRver5.1, RDver5.1, and RD156-1H9 nucleic acid sequences into pGL3vectors, an oligonucleotide having an Nco I site at the initiation codonwas employed, which resulted in an amino acid substitution at position 2to valine.

DETAILED DESCRIPTION OF THE INVENTION

Definitions

The term “gene” as used herein, refers to a DNA sequence that comprisescoding sequences necessary for the production of a polypeptide orprotein precursor.

The polypeptide can be encoded by a full length coding sequence or byany portion of the coding sequence, as long as the desired proteinactivity is retained.

A “nucleic acid”, as used herein, is a covalently linked sequence ofnucleotides in which the 3′ position of the pentose of one nucleotide isjoined by a phosphodiester group to the 5′ position of the pentose ofthe next, and in which the nucleotide residues (bases) are linked inspecific sequence, i.e., a linear order of nucleotides. A“polynucleotide”, as used herein, is a nucleic acid containing asequence that is greater than about 100 nucleotides in length. An“oligonucleotide”, as used herein, is a short polynucleotide or aportion of a polynucleotide. An oligonucleotide typically contains asequence of about two to about one hundred bases. The word “oligo” issometimes used in place of the word “oligonucleotide”.

Nucleic acid molecules are said to have a “5′-terminus” (5′ end) and a“3′-terminus” (3′ end) because nucleic acid phosphodiester linkagesoccur to the 5′ carbon and 3′ carbon of the pentose ring of thesubstituent mononucleotides. The end of a polynucleotide at which a newlinkage would be to a 5′ carbon is its 5′ terminal nucleotide. The endof a polynucleotide at which a new linkage would be to a 3′ carbon isits 3′ terminal nucleotide. A terminal nucleotide, as used herein, isthe nucleotide at the end position of the 3′- or 5′-terminus.

DNA molecules are said to have “5′ ends” and “3′ ends” becausemononucleotides are reacted to make oligonucleotides in a manner suchthat the 5′ phosphate of one mononucleotide pentose ring is attached tothe 3′ oxygen of its neighbor in one direction via a phosphodiesterlinkage. Therefore, an end of an oligonucleotides referred to as the “5′end” if its 5′ phosphate is not linked to the 3′ oxygen of amononucleotide pentose ring and as the “3′ end” if its 3′ oxygen is notlinked to a 5′ phosphate of a subsequent mononucleotide pentose ring.

As used herein, a nucleic acid sequence, even if internal to a largeroligonucleotide or polynucleotide, also may be said to have 5′ and 3′ends. In either a linear or circular DNA molecule, discrete elements arereferred to as being “upstream” or 5′ of the “downstream” or 3′elements. This terminology reflects the fact that transcription proceedsin a 5′ to 3′ fashion along the DNA strand. Typically, promoter andenhancer elements that direct transcription of a linked gene aregenerally located 5′ or upstream of the coding region. However, enhancerelements can exert their effect even when located 3′ of the promoterelement and the coding region. Transcription termination andpolyadenylation signals are located 3′ or downstream of the codingregion.

The term “codon” as used herein, is a basic genetic coding unit,consisting of a sequence of three nucleotides that specify a particularamino acid to be incorporation into a polypeptide chain, or a start orstop signal. FIG. 1 contains a codon table. The term “coding region”when used in reference to structural gene refers to the nucleotidesequences that encode the amino acids found in the nascent polypeptideas a result of translation of a mRNA molecule. Typically, the codingregion is bounded on the 5′ side by the nucleotide triplet “ATG” whichencodes the initiator methionine and on the 3′ side by a stop codon(e.g., TAA, TAG, TGA). In some cases the coding region is also known toinitiate by a nucleotide triplet “TTG”.

By “protein” and “polypeptide” is meant any chain of amino acids,regardless of length or post-translational modification (e.g.,glycosylation or phosphorylation). The synthetic genes of the inventionmay also encode a variant of a naturally-occurring protein orpolypeptide fragment thereof. Preferably, such a protein polypeptide hasan amino acid sequence that is at least 85%, preferably 90%, and mostpreferably 95% or 99% identical to the amino acid sequence of thenaturally-occurring (native) protein from which it is derived.

Polypeptide molecules are said to have an “amino terminus” (N-terminus)and a “carboxy terminus” (C-terminus) because peptide linkages occurbetween the backbone amino group of a first amino acid residue and thebackbone carboxyl group of a second amino acid residue. The terms“N-terminal” and “C-terminal” in reference to polypeptide sequencesrefer to regions of polypeptides including portions of the N-terminaland C-terminal regions of the polypeptide, respectively. A sequence thatincludes a portion of the N-terminal region of polypeptide includesamino acids predominantly from the N-terminal half of the polypeptidechain, but is not limited to such sequences. For example, an N-terminalsequence may include an interior portion of the polypeptide sequenceincluding bases from both the N-terminal and C-terminal halves of thepolypeptide. The same applies to C-terminal regions. N-terminal andC-terminal regions may, but need not, include the amino acid definingthe ultimate N-terminus and C-terminus of the polypeptide, respectively.

The term “wild type” as used herein, refers to a gene or gene productthat has the characteristics of that gene or gene product isolated froma naturally occurring source. A wild type gene is that which is mostfrequently observed in a population and is thus arbitrarily designatedthe “wild type” form of the gene. In contrast, the term “mutant” refersto a gene or gene product that displays modifications in sequence and/orfunctional properties (i.e., altered characteristics) when compared tothe wild type gene or gene product. It is noted that naturally-occurringmutants can be isolated; these are identified by the fact that they havealtered characteristics when compared to the wild type gene or geneproduct.

The terms “complementary” or “complementarity” are used in reference toa sequence of nucleotides related by the base-pairing rules. Forexample, for the sequence 5′ “A-G-T” 3′, is complementary to thesequence 3′“T-C-A” 5′. Complementarity may be “partial,” in which onlysome of the nucleic acids' bases are matched according to the basepairing rules. Or, there may be “complete” or “total” complementaritybetween the nucleic acids. The degree of complementarity between nucleicacid strands has significant effects on the efficiency and strength ofhybridization between nucleic acid strands. This is of particularimportance in amplification reactions, as well as detection methodswhich depend upon hybridization of nucleic acids.

The term “recombinant protein” or “recombinant polypeptide” as usedherein refers to a protein molecule expressed from a recombinant DNAmolecule. In contrast, the term “native protein” is used herein toindicate a protein isolated from a naturally occurring (i.e., anonrecombinant) source. Molecular biological techniques may be used toproduce a recombinant form of a protein with identical properties ascompared to the native form of the protein.

The terms “fusion protein” and “fusion partner” refer to a chimericprotein containing the protein of interest (e.g., luciferase) joined toan exogenous protein fragment (e.g., a fusion partner which consists ofa non-luciferase protein). The fusion partner may enhance the solubilityof protein as expressed in a host cell, may, for example, provide anaffinity tag to allow purification of the recombinant fusion proteinfrom the host cell or culture supernatant, or both. If desired, thefusion partner may be removed from the protein of interest by a varietyof enzymatic or chemical means known to the art.

The terms “cell,” “cell line,” “host cell,” as used herein, are usedinterchangeably, and all such designations include progeny or potentialprogeny of these designations. By “transformed cell” is meant a cellinto which (or into an ancestor of which) has been introduced a DNAmolecule comprising a synthetic gene. Optionally, a synthetic gene ofthe invention may be introduced into a suitable cell line so as tocreate a stably-transfected cell line capable of producing the proteinor polypeptide encoded by the synthetic gene. Vectors, cells, andmethods for constructing such cell lines are well known in the art, e.g.in Ausubel, et al. (infra). The words “transformants” or “transformedcells” include the primary transformed cells derived from the originallytransformed cell without regard to the number of transfers. All progenymay not be precisely identical in DNA content, due to deliberate orinadvertent mutations. Nonetheless, mutant progeny that have the samefunctionality as screened for in the originally transformed cell areincluded in the definition of transformants.

Nucleic acids are known to contain different types of mutations. A“point” mutation refers to an alteration in the sequence of a nucleotideat a single base position from the wild type sequence. Mutations mayalso refer to insertion or deletion of one or more bases, so that thenucleic acid sequence differs from the wild-type sequence.

The term “homology” refers to a degree of complementarity. There may bepartial homology or complete homology (i.e., identity). Homology isoften measured using sequence analysis software (e.g., Sequence AnalysisSoftware Package of the Genetics Computer Group. University of WisconsinBiotechnology Center. 1710 University Avenue. Madison, Wis. 53705). Suchsoftware matches similar sequences by assigning degrees of homology tovarious substitutions, deletions, insertions, and other modifications.Conservative substitutions typically include substitutions within thefollowing groups: glycine, alanine; valine, isoleucine, leucine;aspartic acid, glutamic acid, asparagine, glutamine; serine, threonine;lysine, arginine; and phenylalanine, tyrosine.

A “partially complementary” sequence is one that at least partiallyinhibits a completely complementary sequence from hybridizing to atarget nucleic acid is referred to using the functional term“substantially homologous.” The inhibition of hybridization of thecompletely complementary sequence to the target sequence may be examinedusing a hybridization assay (Southern or Northern blot, solutionhybridization and the like) under conditions of low stringency. Asubstantially homologous sequence or probe will compete for and inhibitthe binding (i.e., the hybridization) of a completely homologous to atarget under conditions of low stringency. This is not to say thatconditions of low stringency are such that non-specific binding ispermitted; low stringency conditions require that the binding of twosequences to one another be a specific (i.e., selective) interaction.The absence of non-specific binding may be tested by the use of a secondtarget which lacks even a partial degree of complementarity (e.g., lessthan about 30% identity). In this case, in the absence of non-specificbinding, the probe will not hybridize to the second non-complementarytarget.

When used in reference to a double-stranded nucleic acid sequence suchas a cDNA or a genomic clone, the term “substantially homologous” refersto any probe which can hybridize to either or both strands of thedouble-stranded nucleic acid sequence under conditions of low stringencyas described herein.

“Probe” refers to an oligonucleotide designed to be sufficientlycomplementary to a sequence in a denatured nucleic acid to be probed (inrelation to its length) to be bound under selected stringencyconditions.

“Hybridization” and “binding” in the context of probes and denaturemelted nucleic acid are used interchangeably. Probes which arehybridized or bound to denatured nucleic acid are base paired tocomplementary sequences in the polynucleotide. Whether or not aparticular probe remains base paired with the polynucleotide depends onthe degree of complementarity, the length of the probe, and thestringency of the binding conditions. The higher the stringency, thehigher must be the degree of complementarity and/or the longer theprobe.

The term “hybridization” is used in reference to the pairing ofcomplementary nucleic acid strands. Hybridization and the strength ofhybridization (i.e., the strength of the association between nucleicacid strands) is impacted by many factors well known in the artincluding the degree of complementarity between the nucleic acids,stringency of the conditions involved affected by such conditions as theconcentration of salts, the Tm (melting temperature) of the formedhybrid, the presence of other components (e.g., the presence or absenceof polyethylene glycol), the molarity of the hybridizing strands and theG:C content of the nucleic acid strands.

The term “stringency” is used in reference to the conditions oftemperature, ionic strength, and the presence of other compounds, underwhich nucleic acid hybridizations are conducted. With “high stringency”conditions, nucleic acid base pairing will occur only between nucleicacid fragments that have a high frequency of complementary basesequences. Thus, conditions of “medium” or “low” stringency are oftenrequired when it is desired that nucleic acids which are not completelycomplementary to one another be hybridized or annealed together. The artknows well that numerous equivalent conditions can be employed tocomprise medium or low stringency conditions. The choice ofhybridization conditions is generally evident to one skilled in the artand is usually guided by the purpose of the hybridization, the type ofhybridization (DNA-DNA or DNA-RNA), and the level of desired relatednessbetween the sequences (e.g., Sarnbrook et al., 1989; Nucleic AcidHybridization, A Practical Approach, IRL Press, Washington D.C., 1985,for a general discussion of the methods).

The stability of nucleic acid duplexes is known to decrease with anincreased number of mismatched bases, and further to be decreased to agreater or lesser degree depending on the relative positions ofmismatches in the hybrid duplexes. Thus, the stringency of hybridizationcan be used to maximize or minimize stability of such duplexes.Hybridization stringency can be altered by: adjusting the temperature ofhybridization; adjusting the percentage of helix destabilizing agents,such as formamide, in the hybridization mix; and adjusting thetemperature and/or salt concentration of the wash solutions. For filterhybridizations, the final stringency of hybridizations often isdetermined by the salt concentration and/or temperature used for thepost-hybridization washes.

“High stringency conditions” when used in reference to nucleic acidhybridization comprise conditions equivalent to binding or hybridizationat 42° C. in a solution consisting of 5×SSPE (43.8 g/l NaCl, 6.9 g/lNaH₂PO₄ H₂O and 1.85 g/l EDTA, pH adjusted to 7.4 with NaOH), 0.5% SDS,5× Denhardt's reagent and 100 μg/ml denatured salmon sperm DNA followedby washing in a solution comprising 0.1×SSPE, 1.0% SDS at 42° C. when aprobe of about 500 nucleotides in length is employed.

“Medium stringency conditions” when used in reference to nucleic acidhybridization comprise conditions equivalent to binding or hybridizationat 42° C. in a solution consisting of 5×SSPE (43.8 g/l NaCl, 6.9 g/lNaH₂PO₄ H₂O and 1.85 g/l EDTA, pH adjusted to 7.4 with NaOH), 0.5% SDS,5× Denhardt's reagent and 100 μg/ml denatured salmon sperm DNA followedby washing in a solution comprising 0.1×SSPE, 1.0% SDS at 42° C. when aprobe of about 500 nucleotides in length is employed.

“Low stringency conditions” comprise conditions equivalent to binding orhybridization at 42° C. in a solution consisting of 5×SSPE (43.8 g/lNaCl, 6.9 g/l NaH₂PO₄ H₂O and 1.85 g/l EDTA, pH adjusted to 7.4 withNaOH), 0.1% SDS, 5× Denhardt's reagent [50× Denhardt's contains per 500ml: 5 g Ficoll (Type 400, Pharmacia), 5 g BSA (Fraction V; Sigma)] and100 g/ml denatured salmon sperm DNA followed by washing in a solutioncomprising 5×SSPE, 0.1% SDS at 42° C. when a probe of about 500nucleotides in length is employed.

The term “T_(m)” is used in reference to the “melting temperature”. Themelting temperature is the temperature at which 50% of a population ofdouble-stranded nucleic acid molecules becomes dissociated into singlestrands. The equation for calculating the T_(m) of nucleic acids iswell-known in the art. The Tm of a hybrid nucleic acid is oftenestimated using a formula adopted from hybridization assays in 1 M salt,and commonly used for calculating Tm for PCR primers: [(number ofA+T)×2° C.+(number of G+C)×4° C.]. (C. R. Newton et al., PCR, 2nd Ed.,Springer-Verlag (New York, 1997), p. 24). This formula was found to beinaccurate for primers longer than 20 nucleotides. (Id.) Another simpleestimate of the T_(m) value may be calculated by the equation:T_(m)=81.5+0.41(% G+C), when a nucleic acid is in aqueous solution at 1M NaCl. (e.g., Anderson and Young, Quantitative Filter Hybridization, inNucleic Acid Hybridization, 1985). Other more sophisticated computationsexist in the art which take structural as well as sequencecharacteristics into account for the calculation of T_(m). A calculatedT_(m) is merely an estimate; the optimum temperature is commonlydetermined empirically.

The term “isolated” when used in relation to a nucleic acid, as in“isolated oligonucleotide” or “isolated polynucleotide” refers to anucleic acid sequence that is identified and separated from at least onecontaminant with which it is ordinarily associated in its source. Thus,an isolated nucleic acid is present in a form or setting that isdifferent from that in which it is found in nature. In contrast,non-isolated nucleic acids (e.g., DNA and RNA) are found in the statethey exist in nature. For example, a given DNA sequence (e.g., a gene)is found on the host cell chromosome in proximity to neighboring genes;RNA sequences (e.g., a specific mRNA sequence encoding a specificprotein), are found in the cell as a mixture with numerous other mRNAsthat encode a multitude of proteins. However, isolated nucleic acidincludes, by way of example, such nucleic acid in cells ordinarilyexpressing that nucleic acid where the nucleic acid is in a chromosomallocation different from that of natural cells, or is otherwise flankedby a different nucleic acid sequence than that found in nature. Theisolated nucleic acid or oligonucleotide may be present insingle-stranded or double-stranded form. When an isolated nucleic acidor oligonucleotide is to be utilized to express a protein, theoligonucleotide contains at a minimum, the sense or coding strand (i.e.,the oligonucleotide may single-stranded), but may contain both the senseand anti-sense strands (i.e., the oligonucleotide may bedouble-stranded).

The term “isolated” when used in relation to a polypeptide, as in“isolated protein” or “isolated polypeptide” refers to a polypeptidethat is identified and separated from at least one contaminant withwhich it is ordinarily associated in its source. Thus, an isolatedpolypeptide is present in a form or setting that is different from thatin which it is found in nature. In contrast, non-isolated polypeptides(e.g., proteins and enzymes) are found in the state they exist innature.

The term “purified” or “to purify” means the result of any process thatremoves some of a contaminant from the component of interest, such as aprotein or nucleic acid. The percent of a purified component is therebyincreased in the sample.

The term “operably linked” as used herein refer to the linkage ofnucleic acid sequences in such a manner that a nucleic acid moleculecapable of directing the transcription of a given gene and/or thesynthesis of a desired protein molecule is produced. The term alsorefers to the linkage of sequences encoding amino acids in such a mannerthat a functional (e.g., enzymatically active, capable of binding to abinding partner, capable of inhibiting, etc.) protein or polypeptide isproduced.

The term “recombinant DNA molecule” means a hybrid DNA sequencecomprising at least two nucleotide sequences not normally found togetherin nature.

The term “vector” is used in reference to nucleic acid molecules intowhich fragments of DNA may be inserted or cloned and can be used totransfer DNA segment(s) into a cell and capable of replication in acell. Vectors may be derived from plasmids, bacteriophages, viruses,cosmids, and the like.

The terms “recombinant vector” and “expression vector” as used hereinrefer to DNA or RNA sequences containing a desired coding sequence andappropriate DNA or RNA sequences necessary for the expression of theoperably linked coding sequence in a particular host organism.Prokaryotic expression vectors include a promoter, a ribosome bindingsite, an origin of replication for autonomous replication in a host celland possibly other sequences, e.g. an optional operator sequence,optional restriction enzyme sites. A promoter is defined as a DNAsequence that directs RNA polymerase to bind to DNA and to initiate RNAsynthesis. Eukaryotic expression vectors include a promoter, optionallya polyadenlyation signal and optionally an enhancer sequence.

The term “a polynucleotide having a nucleotide sequence encoding agene,” means a nucleic acid sequence comprising the coding region of agene, or in other words the nucleic acid sequence which encodes a geneproduct. The coding region may be present in either a cDNA, genomic DNAor RNA form. When present in a DNA form, the oligonucleotide may besingle-stranded (i.e., the sense strand) or double-stranded. Suitablecontrol elements such as enhancers/promoters, splice junctions,polyadenylation signals, etc. may be placed in close proximity to thecoding region of the gene if needed to permit proper initiation oftranscription and/or correct processing of the primary RNA transcript.Alternatively, the coding region utilized in the expression vectors ofthe present invention may contain endogenous enhancers/promoters, splicejunctions, intervening sequences, polyadenylation signals, etc. Infurther embodiments, the coding region may contain a combination of bothendogenous and exogenous control elements.

The term “transcription regulatory element” or “transcription regulatorysequence” refers to a genetic element or sequence that controls someaspect of the expression of nucleic acid sequence(s). For example, apromoter is a regulatory element that facilitates the initiation oftranscription of an operably linked coding region. Other regulatoryelements include, but are not limited to, transcription factor bindingsites, splicing signals, polyadenylation signals, termination signalsand enhancer elements.

Transcriptional control signals in eukaryotes comprise “promoter” and“enhancer” elements. Promoters and enhancers consist of short arrays ofDNA sequences that interact specifically with cellular proteins involvedin transcription (Maniatis et al., 1987). Promoter and enhancer elementshave been isolated from a variety of eukaryotic sources including genesin yeast, insect and mammalian cells. Promoter and enhancer elementshave also been isolated from viruses and analogous control elements,such as promoters, are also found in prokaryotes. The selection of aparticular promoter and enhancer depends on the cell type used toexpress the protein of interest. Some eukaryotic promoters and enhancershave a broad host range while others are functional in a limited subsetof cell types (for review, see Voss et al., 1986; and Maniatis et al.,1987. For example, the SV40 early gene enhancer is very active in a widevariety of cell types from many mammalian species and has been widelyused for the expression of proteins in mammalian cells (Dijkema et al.,1985). Two other examples of promoter/enhancer elements active in abroad range of mammalian cell types are those from the human elongationfactor 1 gene (Uetsuki et al., 1989; Kim, et al., 1990; and Mizushimaand Nagata, 1990) and the long terminal repeats of the Rous sarcomavirus (Gorman et al., 1982); and the human cytomegalovirus (Boshart etal., 1985).

The term “promoter/enhancer” denotes a segment of DNA containingsequences capable of providing both promoter and enhancer functions(i.e., the functions provided by a promoter element and an enhancerelement as described above). For example, the long terminal repeats ofretroviruses contain both promoter and enhancer functions. Theenhancer/promoter may be “endogenous” or “exogenous” or “heterologous.”An “endogenous” enhancer/promoter is one that is naturally linked with agiven gene in the genome. An “exogenous” or “heterologous”enhancer/promoter is one that is placed in juxtaposition to a gene bymeans of genetic manipulation (i.e., molecular biological techniques)such that transcription of the gene is directed by the linkedenhancer/promoter.

The presence of “splicing signals” on an expression vector often resultsin higher levels of expression of the recombinant transcript ineukaryotic host cells. Splicing signals mediate the removal of intronsfrom the primary RNA transcript and consist of a splice donor andacceptor site (Sambrook, et al., Molecular Cloning: A Laboratory Manual,2nd ed., Cold Spring Harbor Laboratory Press, New York, 1989, pp.16.7-16.8). A commonly used splice donor and acceptor site is the splicejunction from the 16S RNA of SV40.

Efficient expression of recombinant DNA sequences in eukaryotic cellsrequires expression of signals directing the efficient termination andpolyadenylation of the resulting transcript. Transcription terminationsignals are generally found downstream of the polyadenylation signal andare a few hundred nucleotides in length. The term “poly(A) site” or“poly(A) sequence” as used herein denotes a DNA sequence which directsboth the termination and polyadenylation of the nascent RNA transcript.Efficient polyadenylation of the recombinant transcript is desirable, astranscripts lacking a poly(A) tail are unstable and are rapidlydegraded. The poly(A) signal utilized in an expression vector may be“heterologous” or “endogenous.” An endogenous poly(A) signal is one thatis found naturally at the 3′ end of the coding region of a given gene inthe genome. A heterologous poly(A) signal is one which has been isolatedfrom one gene and positioned 3′ to another gene. A commonly usedheterologous poly(A) signal is the SV40 poly(A) signal. The SV40 poly(A)signal is contained on a 237 bp BamH I/Bcl I restriction fragment anddirects both termination and polyadenylation (Sambrook, supra, at16.6-16.7).

Eukaryotic expression vectors may also contain “viral replicons ” or“viral origins of replication.” Viral replicons are viral DNA sequenceswhich allow for the extrachromosomal replication of a vector in a hostcell expressing the appropriate replication factors. Vectors containingeither the SV40 or polyoma virus origin of replication replicate to highcopy number (up to 10⁴ copies/cell) in cells that express theappropriate viral T antigen. In contrast, vectors containing thereplicons from bovine papillomavirus or Epstein-Barr virus replicateextrachromosomally at low. copy number (about 100 copies/cell).

The term “in vitro” refers to an artificial environment and to processesor reactions that occur within an artificial environment. In vitroenvironments include, but are not limited to, test tubes and celllysates. The term “in situ” refers to cell culture. The term “in vivo”refers to the natural environment (e.g., an animal or a cell) and toprocesses or reaction that occur within a natural environment.

The term “expression system” refers to any assay or system fordetermining (e.g., detecting) the expression of a gene of interest.Those skilled in the field of molecular biology will understand that anyof a wide variety of expression systems may be used. A wide range ofsuitable mammalian cells are available from a wide range of source(e.g., the American Type Culture Collection, Rockland, Md.). The methodof transformation or transfection and the choice of expression vehiclewill depend on the host system selected. Transformation and transfectionmethods are described, e.g., in Ausubel, et al., Current Protocols inMolecular Biology. John Wiley & Sons, New York. 1992. Expression systemsinclude in vitro gene expression assays where a gene of interest (e.g.,a reporter gene) is linked to a regulatory sequence and the expressionof the gene is monitored following treatment with an agent that inhibitsor induces expression of the gene. Detection of gene expression can bethrough any suitable means including, but not limited to, detection ofexpressed mRNA or protein (e.g., a detectable product of a reportergene) or through a detectable change in the phenotype of a cellexpressing the gene of interest. Expression systems may also compriseassays where a cleavage event or other nucleic acid or cellular changeis detected.

The term “enzyme” refers to molecules or molecule aggregates that areresponsible for catalyzing chemical and biological reactions. Suchmolecules are typically proteins, but can also comprise short peptides,RNAs, ribozymes, antibodies, and other molecules. A molecule thatcatalyzes chemical and biological reactions is referred to as “havingenzyme activity” or “having catalytic activity.”

All amino acid residues identified herein are in the naturalL-configuration. In keeping with standard polypeptide nomenclature (seeJ. Biol. Chem., 243, 3557 (1969)), abbreviations for amino acid residuesare as shown in the following Table of Correspondence. TABLE OFCORRESPONDENCE 1-Letter 3-Letter AMINO ACID Y Tyr L-tyrosine G Glyglycine F Phe L-phenylalanine M Met L-methionine A Ala L-alanine S SerL-serine I Ile L-isoleucine L Leu L-leucine T Thr L-threonine V ValL-valine P Pro L-proline K Lys L-lysine H His L-histidine Q GlnL-glutamine E Glu L-glutamic acid W Trp L-tryptophan R Arg L-arginine DAsp L-aspartic acid N Asn L-asparagine C Cys L-cysteine

The term “sequence homology” means the proportion of base matchesbetween two nucleic acid sequences or the proportion of amino acidmatches between two amino acid sequences. When sequence homology isexpressed as a percentage, e.g., 50%, the percentage denotes theproportion of matches over the length of sequence from one sequence thatis compared to some other sequence. Gaps (in either of the twosequences) are permitted to maximize matching; gap lengths of 15 basesor less are usually used, 6 bases or less are preferred with 2 bases orless more preferred. When using oligonucleotides as probes ortreatments, the sequence homology between the target nucleic acid andthe oligonucleotide sequence is generally not less than 17 target basematches out of 20 possible oligonucleotide base pair matches (85%);preferably not less than 9 matches out of 10 possible base pair matches(90%), and more preferably not less than 19 matches out of 20 possiblebase pair matches (95%).

Two amino acid sequences are homologous if there is a partial orcomplete identity between their sequences. For example, 85% homologymeans that 85% of the amino acids are identical when the two sequencesare aligned for maximum matching. Gaps (in either of the two sequencesbeing matched) are allowed in maximizing matching; gap lengths of 5 orless are preferred with 2 or less being more preferred. Alternativelyand preferably, two protein sequences (or polypeptide sequences derivedfrom them of at least 100 amino acids in length) are homologous, as thisterm is used herein, if they have an alignment score of at more than 5(in standard deviation units) using the program ALIGN with the mutationdata matrix and a gap penalty of 6 or greater. See Dayhoff, M. O., inAtlas of Protein Sequence and Structure, 1972, volume 5, NationalBiomedical Research Foundation, pp. 101-110, and Supplement 2 to thisvolume, pp. 1-10. The two sequences or parts thereof are more preferablyhomologous if their amino acids are greater than or equal to 85%identical when optimally aligned using the ALIGN program.

The following terms are used to describe the sequence relationshipsbetween two or more polynucleotides: “reference sequence”, “comparisonwindow”, “sequence identity”, “percentage of sequence identity”, and“substantial identity”. A “reference sequence” is a defined sequenceused as a basis for a sequence comparison; a reference sequence may be asubset of a larger sequence, for example, as a segment of a full-lengthcDNA or gene sequence given in a sequence listing, or may comprise acomplete cDNA or gene sequence. Generally, a reference sequence is atleast 20 nucleotides in length, frequently at least 25 nucleotides inlength, and often at least 50 nucleotides in length. Since twopolynucleotides may each (1) comprise a sequence (i.e., a portion of thecomplete polynucleotide sequence) that is similar between the twopolynucleotides, and (2) may further comprise a sequence that isdivergent between the two polynucleotides, sequence comparisons betweentwo (or more) polynucleotides are typically performed by comparingsequences of the two polynucleotides over a “comparison window” toidentify and compare local regions of sequence similarity.

A “comparison window”, as used herein, refers to a conceptual segment ofat least 20 contiguous nucleotides and wherein the portion of thepolynucleotide sequence in the comparison window may comprise additionsor deletions (i.e., gaps) of 20 percent or less as compared to thereference sequence (which does not comprise additions or deletions) foroptimal alignment of the two sequences.

Methods of alignment of sequences for comparison are well known in theart. Thus, the determination of percent identity between any twosequences can be accomplished using a mathematical algorithm. Preferred,non-limiting examples of such mathematical algorithms are the algorithmof Myers and Miller (1988); the local homology algorithm of Smith andWaterman (1981); the homology alignment algorithm of Needleman andWunsch (1970); the search-for-similarity-method of Pearson and Lipman(1988); the algorithm of Karlin and Altschul (1990), modified as inKarlin and Altschul (1993).

Computer implementations of these mathematical algorithms can beutilized for comparison of sequences to determine sequence identity.Such implementations include, but are not limited to: CLUSTAL in thePC/Gene program (available from Intelligenetics, Mountain View, Calif.);the ALIGN program (Version 2.0) and GAP, BESTFIT, BLAST, FASTA, andTFASTA in the Wisconsin Genetics Software Package, Version 8 (availablefrom Genetics Computer Group (GCG), 575 Science Drive, Madison, Wis.,USA). Alignments using these programs can be performed using the defaultparameters. The CLUSTAL program is well described by Higgins et al.(1988); Higgins et al. (1989); Corpet et al. (1988); Huang et al.(1992); and Pearson et al. (1994). The ALIGN program is based on thealgorithm of Myers and Miller, supra. The BLAST programs of Altschul etal. (1990), are based on the algorithm of Karlin and Altschul supra. Toobtain gapped alignments for comparison purposes, Gapped BLAST (in BLAST2.0) can be utilized as described in Altschul et al. (1997).Alternatively, PSI-BLAST (in BLAST 2.0) can be used to perform aniterated search that detects distant relationships between molecules.See Altschul et al., supra. When utilizing BLAST, Gapped BLAST,PSI-BLAST, the default parameters of the respective programs (e.g.BLASTN for nucleotide sequences, BLASTX for proteins) can be used. Seehttp://www.ncbi.nlm.nih.gov. Alignment may also be performed manually byinspection

The term “sequence identity” means that two polynucleotide sequences areidentical (i.e., on a nucleotide-by-nucleotide basis) over the window ofcomparison. The term “percentage of sequence identity” means that twopolynucleotide sequences are identical (i.e., on anucleotide-by-nucleotide basis) for the stated proportion of nucleotidesover the window of comparison. The term “percentage of sequenceidentity” is calculated by comparing two optimally aligned sequencesover the window of comparison, determining the number of positions atwhich the identical nucleic acid base (e.g., A, T, C, G, U, or I) occursin both sequences to yield the number of matched positions, dividing thenumber of matched positions by the total number of positions in thewindow of comparison (i.e., the window size), and multiplying the resultby 100 to yield the percentage of sequence identity. The terms“substantial identity” as used herein denote a characteristic of apolynucleotide sequence, wherein the polynucleotide comprises a sequencethat has at least 60%, preferably at least 65%, more preferably at least70%, up to about 85%, and even more preferably at least 90 to 95%, moreusually at least 99%, sequence identity as compared to a referencesequence over a comparison window of at least 20 nucleotide positions,frequently over a window of at least 20-50 nucleotides, and preferablyat least 300 nucleotides, wherein the percentage of sequence identity iscalculated by comparing the reference sequence to the polynucleotidesequence which may include deletions or additions which total 20 percentor less of the reference sequence over the window of comparison. Thereference sequence may be a subset of a larger sequence.

As applied to polypeptides, the term “substantial identity” means thattwo peptide sequences, when optimally aligned, such as by the programsGAP or BESTFIT using default gap weights, share at least about 85%sequence identity, preferably at least about 90% sequence identity, morepreferably at least about 95% sequence identity, and most preferably atleast about 99% sequence identity.

The Synthetic Nucleic Acid Molecules and Methods of the Invention

The invention provides compositions comprising synthetic nucleic acidmolecules, as well as methods for preparing those molecules which yieldsynthetic nucleic acid molecules that are efficiently expressed as apolypeptide or protein with desirable characteristics including reducedinappropriate or unintended transcription characteristics when expressedin a particular cell type.

Natural selection is the hypothesis that genotype-environmentinteractions occurring at the phenotypic level lead to differentialreproductive success of individuals and hence to modification of thegene pool of a population. It is generally accepted that the amino acidsequence of a protein found in nature has undergone optimization bynatural selection. However, amino acids exist within the sequence of aprotein that do not contribute significantly to the activity of theprotein and these amino acids can be changed to other amino acids withlittle or no consequence. Furthermore, a protein may be useful outsideits natural environment or for purposes that differ from the conditionsof its natural selection. In these circumstances, the amino acidsequence can be synthetically altered to better adapt the protein forits utility in various applications.

Likewise, the nucleic acid sequence that encodes a protein is alsooptimized by natural selection. The relationship between coding DNA andits transcribed RNA is such that any change to the DNA affects theresulting RNA. Thus, natural selection works on both moleculessimultaneously. However, this relationship does not exist betweennucleic acids and proteins. Because multiple codons encode the sameamino acid, many different nucleotide sequences can encode an identicalprotein. A specific protein composed of 500 amino acids cantheoretically be encoded by more than 10¹⁵⁰ different nucleic acidsequences.

Natural selection acts on nucleic acids to achieve proper encoding ofthe corresponding protein. Presumably, other properties of nucleic acidmolecules are also acted upon by natural selection. These propertiesinclude codon usage frequency, RNA secondary structure, the efficiencyof intron splicing, and interactions with transcription factors or othernucleic acid binding proteins. These other properties may alter theefficiency of protein translation and the resulting phenotype. Becauseof the redundant nature of the genetic code, these other attributes canbe optimized by natural selection without altering the correspondingamino acid sequence.

Under some conditions, it is useful to synthetically alter the naturalnucleotide sequence encoding a protein to better adapt the protein foralternative applications. A common example is to alter the codon usagefrequency of a gene when it is expressed in a foreign host. Althoughredundancy in the genetic code allows amino acids to be encoded bymultiple codons, different organisms favor some codons over others. Thecodon usage frequencies tend to differ most for organisms with widelyseparated evolutionary histories. It has been found that whentransferring genes between evolutionarily distant organisms, theefficiency of protein translation can be substantially increased byadjusting the codon usage frequency (see U.S. Pat. Nos. 5,096,825,5,670,356 and 5,874,304).

Because of the need for evolutionary distance, the codon usage ofreporter genes often does not correspond to the optimal codon usage ofthe experimental cells. Examples include β-galactosidase (β-gal) andchloramphenicol acetyltransferase (cat) reporter genes that are derivedfrom E. coli and are commonly used in mammalian cells; theβ-glucuronidase (gus) reporter gene that is derived from E. coli andcommonly used in plant cells; the firefly luciferase (luc) reporter genethat is derived from an insect and commonly used in plant and mammaliancells; and the Renilla luciferase, and green fluorescent protein (gfp)reporter genes which are derived from coelenterates and are commonlyused in plant and mammalian cells. To achieve sensitive quantitation ofreporter gene expression, the activity of the gene product must not beendogenous to the experimental host cells. Thus, reporter genes areusually selected from organisms having unique and distinctivephenotypes. Consequently, these organisms often have widely separatedevolutionary histories from the experimental host cells.

Previously, to create genes having a more optimal codon usage frequencybut still encoding the same gene product, a synthetic nucleic acidsequence was made by replacing existing codons with codons that weregenerally more favorable to the experimental host cell (see U.S. Pat.Nos. 5,096,825, 5,670,356 and 5,874,304.) The result was a netimprovement in codon usage frequency of the synthetic gene. However, theoptimization of other attributes was not considered and so thesesynthetic genes likely did not reflect genes optimized by naturalselection.

In particular, improvements in codon usage frequency are intended onlyfor optimization of a RNA sequence based on its role in translation intoa protein. Thus, previously described methods did not address how thesequence of a synthetic gene affects the role of DNA in transcriptioninto RNA. Most notably, consideration had not been given as to howtranscription factors may interact with the synthetic DNA andconsequently modulate or otherwise influence gene transcription. Forgenes found in nature, the DNA would be optimally transcribed by thenative host cell and would yield an RNA that encodes a properly foldedgene product. In contrast, synthetic genes have previously not beenoptimized for transcriptional characteristics. Rather, this property hasbeen ignored or left to chance.

This concern is important for all genes, but particularly important forreporter genes, which are most commonly used to quantitatetranscriptional behavior in the experimental host cells. Hundreds oftranscription factors have been identified in different cell types underdifferent physiological conditions, and likely more exist but have notyet been identified. All of these transcription factors can influencethe transcription of an introduced gene. A useful synthetic reportergene of the invention has a minimal risk of influencing or perturbingintrinsic transcriptional characteristics of the host cell because thestructure of that gene has been altered. A particularly useful syntheticreporter gene will have desirable characteristics under a new set and/ora wide variety of experimental conditions. To best achieve thesecharacteristics, the structure of the synthetic gene should have minimalpotential for interacting with transcription factors within a broadrange of host cells and physiological conditions. Minimizing potentialinteractions between a reporter gene and a host cell's endogenoustranscription factors increases the value of a reporter gene by reducingthe risk of inappropriate transcriptional characteristics of the genewithin a particular experiment, increasing applicability of the gene invarious environments, and increasing the acceptance of the resultingexperimental data.

In contrast, a reporter gene comprising a native nucleotide sequence,based on a genomic or cDNA clone from the original host organism, mayinteract with transcription factors when expressed in an exogenous host.This risk stems from two circumstances. First, the native nucleotidesequence contains sequences that were optimized through naturalselection to influence gene transcription within the native hostorganism. However, these sequences might also influence transcriptionwhen the gene is expressed in exogenous hosts, i.e., out of context,thus interfering with its performance as a reporter gene. Second, thenucleotide sequence may inadvertently interact with transcriptionfactors that were not present in the native host organism, and thus didnot participate in its natural selection. The probability of suchinadvertent interactions increases with greater evolutionary separationbetween the experimental cells and the native organism of the reportergene.

These potential interactions with transcription factors would likely bedisrupted when using a synthetic reporter gene having alterations incodon usage frequency. However, a synthetic reporter gene sequence,designed by choosing codons based only on codon usage frequency, islikely to contain other unintended transcription factor binding sitessince the synthetic gene has not been subjected to the benefit ofnatural selection to correct inappropriate transcriptional activities.Inadvertent interactions with transcription factors could also occurwhenever the encoded amino acid sequence is artificially altered, e.g.,to introduce amino acid substitutions. Similarly, these changes have notbeen subjected to natural selection, and thus may exhibit undesiredcharacteristics.

Thus, the invention provides a method for preparing synthetic nucleicacid sequences that reduce the risk of undesirable interactions of thenucleic acid with transcription factors when expressed in a particularhost cell, thereby reducing inappropriate or unintended transcriptionalcharacteristics. Preferably, the method yields synthetic genescontaining improved codon usage frequencies for a particular host celland with a reduced occurrence of transcription factor binding sites. Theinvention also provides a method of preparing synthetic genes containingimproved codon usage frequencies with a reduced occurrence oftranscription factor binding sites and additional beneficial structuralattributes. Such additional attributes include the absence ofinappropriate RNA splicing junctions, poly(A) addition signals,undesirable restriction sites, ribosomal binding sites, and secondarystructural motifs such as hairpin loops.

Also provided is a method for preparing two synthetic genes encoding thesame or highly similar proteins (“codon distinct” versions). Preferably,the two synthetic genes have a reduced ability to hybridize to a commonpolynucleotide probe sequence, or have a reduced risk of recombiningwhen present together in living cells. To detect recombination, PCRamplification of the reporter sequences using primers complementary toflanking sequences and sequencing of the amplified sequences may beemployed.

To select codons for the synthetic nucleic acid molecules of theinvention, preferred codons have a relatively high codon usage frequencyin a selected host cell, and their introduction results in theintroduction of relatively few transcription factor binding sites,relatively few other undesirable structural attributes, and optionally acharacteristic that distinguishes the synthetic gene from another geneencoding a highly similar protein. Thus, the synthetic nucleic acidproduct obtained by the method of the invention is a synthetic gene withimproved level of expression due to improved codon usage frequency, areduced risk of inappropriate transcriptional behavior due to a reducednumber of undesirable transcription regulatory sequences, and optionallyany additional characteristic due to other criteria that may be employedto select the synthetic sequence.

The invention may be employed with any nucleic acid sequence, e.g., anative sequence such as a cDNA or one which has been manipulated invitro, e.g., to introduce specific alterations such as the introductionor removal of a restriction enzyme recognition site, the alteration of acodon to encode a different amino acid or to encode a fusion protein, orto alter GC or AT content (% of composition) of nucleic acid molecules.Moreover, the method of the invention is useful with any gene, butparticularly useful for reporter genes as well as other genes associatedwith the expression of reporter genes, such as selectable markers.Preferred genes include, but are not limited to, those encodinglactamase (β-gal), neomycin resistance (Neo), CAT, GUS,galactopyranoside, GFP, xylosidase, thymidine kinase, arabinosidase andthe like. As used herein, a “marker gene” or “reporter gene” is a genethat imparts a distinct phenotype to cells expressing the gene and thuspermits cells having the gene to be distinguished from cells that do nothave the gene. Such genes may encode either a selectable or screenablemarker, depending on whether the marker confers a trait which one can‘select’ for by chemical means, i.e., through the use of a selectiveagent (e.g., a herbicide, antibiotic, or the like), or whether it issimply a “reporter” trait that one can identify through observation ortesting, i.e., by ‘screening’. Elements of the present disclosure areexemplified in detail through the use of particular marker genes. Ofcourse, many examples of suitable marker genes or reporter genes areknown to the art and can be employed in the practice of the invention.Therefore, it will be understood that the following discussion isexemplary rather than exhaustive. In light of the techniques disclosedherein and the general recombinant techniques which are known in theart, the present invention renders possible the alteration of any gene.

Exemplary marker genes include, but are not limited to, a neo gene, aβ-gal gene, a gus gene, a cat gene, a gpt gene, a hyg gene, a hisD gene,a ble gene, a mprt gene, a bar gene, a nitrilase gene, a mutantacetolactate synthase gene (ALS) or acetoacid synthase gene (AAS), amethotrexate-resistant dhfr gene, a dalapon dehalogenase gene, a mutatedanthranilate synthase gene that confers resistance to 5-methyltryptophan (WO 97/26366), an R-locus gene, a β-lactamase gene, a xylEgene, an α-amylase gene, a tyrosinase gene, a luciferase (luc) gene,(e.g., a Renilla reniformis luciferase gene, a firefly luciferase gene,or a click beetle luciferase (Pyrophorus plagiophthalamus) gene), anaequorin gene, or a green fluorescent protein gene. Included within theterms selectable or screenable marker genes are also genes which encodea “secretable marker” whose secretion can be detected as a means ofidentifying or selecting for transformed cells. Examples include markerswhich encode a secretable antigen that can be identified by antibodyinteraction, or even secretable enzymes which can be detected by theircatalytic activity. Secretable proteins fall into a number of classes,including small, diffusible proteins detectable, e.g., by ELISA, andproteins that are inserted or trapped in the cell membrane.

The method of the invention can be performed by, although it is notlimited to, a recursive process. The process includes assigningpreferred codons to each amino acid in a target molecule, e.g., a nativenucleotide sequence, based on codon usage in a particular species,identifying potential transcription regulatory sequences such astranscription factor binding sites in the nucleic acid sequence havingpreferred codons, e.g., using a database of such binding sites,optionally identifying other undesirable sequences, and substituting analternative codon (i.e., encoding the same amino acid) at positionswhere undesirable transcription factor binding sites or other sequencesoccur. For codon distinct versions, alternative preferred codons aresubstituted in each version. If necessary, the identification andelimination of potential transcription factor or other undesirablesequences can be repeated until a nucleotide sequence is achievedcontaining a maximum number of preferred codons and a minimum number ofundesired sequences including transcription regulatory sequences orother undesirable sequences. Also, optionally, desired sequences, e.g.,restriction enzyme recognition sites, can be introduced. After asynthetic nucleic acid molecule is designed and constructed, itsproperties relative to the parent nucleic acid sequence can bedetermined by methods well known to the art. For example, the expressionof the synthetic and target nucleic acid molecules in a series ofvectors in a particular cell can be compared.

Thus, generally, the method of the invention comprises identifying atarget nucleic acid sequence, such as a vector backbone, a reporter geneor a selectable marker gene, and a host cell of interest, for example, aplant (dicot or monocot), fungus, yeast or mammalian cell. Preferredhost cells are mammalian host cells such as CHO, COS, 293, Hela, CV-1and NIH3T3 cells. Based on preferred codon usage in the host cell(s)and, optionally, low codon usage in the host cell(s), e.g., high usagemammalian codons and low usage E. coli and mammalian codons, codons tobe replaced are determined. For codon distinct versions of two syntheticnucleic acid molecules, alternative preferred codons are introduced toeach version. Thus, for amino acids having more than two codons, onepreferred codon is introduced to one version and another preferred codonis introduced to the other version. For amino acids having six codons,the two codons with the largest number of mismatched bases areidentified and one is introduced to one version and the other codon isintroduced to the other version. Concurrent, subsequent or prior toselecting codons to be replaced, desired and undesired sequences, suchas undesired transcriptional regulatory sequences, in the targetsequence are identified. These sequences can be identified usingdatabases and software such as EPD, NNPD, REBASE, TRANSFAC, TESS,GenePro, MAR (www.ncgr.org/MAR-search) and BCM Gene Finder, furtherdescribed herein. After the sequences are identified, themodification(s) are introduced. Once a desired synthetic nucleic acidsequence is obtained, it can be prepared by methods well known to theart (such as PCR with overlapping primers), and its structural andfunctional properties compared to the target nucleic acid sequence,including, but not limited to, percent homology, presence or absence ofcertain sequences, for example, restriction sites, percent of codonschanged (such as an increased or decreased usage of certain codons) andexpression rates.

As described below, the method was used to create synthetic reportergenes encoding Renilla reniformis luciferase, and two click beetleluciferases (one emitting green light and the other emitting red light).For both systems, the synthetic genes support much greater levels ofexpression than the corresponding native or parent genes for theprotein. In addition, the native and parent genes demonstrated anomaloustranscription characteristics when expressed in mammalian cells, whichwere not evident in the synthetic genes. In particular, basal expressionof the native or parent genes is relatively high. Furthermore, theexpression is induced to very high levels by an enhancer sequence in theabsence of known promoters. The synthetic genes show lower basalexpression and do not show the anomalous enhancer behavior. Presumably,the enhancer is activating transcriptional elements found in the nativegenes that are absent in the synthetic genes. The results clearly showthat the synthetic nucleic acid sequences exhibit superior performanceas reporter genes.

Exemplary Uses of the Molecules of the Invention

The synthetic genes of the invention preferably encode the same proteinsas their native counterpart (or nearly so), but have improved codonusage while being largely devoid of known transcription regulatoryelements in the coding region. (It is recognized that a small number ofamino acid changes may be desired to enhance a property of the nativecounterpart protein, e.g. to enhance luminescence of a luciferase.) Thisincreases the level of expression of the protein the synthetic geneencodes and reduces the risk of anomalous expression of the protein. Forexample, studies of many important events of gene regulation, which maybe mediated by weak promoters, are limited by insufficient reportersignals from inadequate expression of the reporter proteins. Thesynthetic luciferase genes described herein permit detection of weakpromoter activity because of the large increase in level of expression,which enables increased detection sensitivity. Also, the use of someselectable markers may be limited by the expression of that marker in anexogenous cell. Thus, synthetic selectable marker genes which haveimproved codon usage for that cell, and have a decrease in otherundesirable sequences, (e.g., transcription factor binding sites), canpermit the use of those markers in cells that otherwise were undesirableas hosts for those markers.

Promoter crosstalk is another concern when a co-reporter gene is used tonormalize transfection efficiencies. With the enhanced expression ofsynthetic genes, the amount of DNA containing strong promoters can bereduced, or DNA containing weaker promoters can be employed, to drivethe expression of the co-reporter. In addition, there may be a reductionin the background expression from the synthetic reporter genes of theinvention. This characteristic makes synthetic reporter genes moredesirable by minimizing the sporadic expression from the genes andreducing the interference resulting from other regulatory pathways.

The use of reporter genes in imaging systems, which can be used for invivo biological studies or drug screening, is another use for thesynthetic genes of the invention. Due to their increased level ofexpression, the protein encoded by a synthetic gene is more readilydetectable by an imaging system. In fact, using a synthetic Renillaluciferase gene, luminescence in transfected CHO cells was detectedvisually without the aid of instrumentation.

In addition, the synthetic genes may be used to express fusion proteins,for example fusions with secretion leader sequences or cellularlocalization sequences, to study transcription in difficult-to-transfectcells such as primary cells, and/or to improve the analysis ofregulatory pathways and genetic elements. Other uses include, but arenot limited to, the detection of rare events that require extremesensitivity (e.g., studying RNA recoding), use with IRES, to improve theefficiency of in vitro translation or in vitro transcription-translationcoupled systems such as TNT (Promega Corp., Madison, Wis.), study ofreporters optimized to different host organisms (e.g., plants, fungus,and the like), use of multiple genes as co-reporters to monitor drugtoxicity, as reporter molecules in multiwell assays, and as reportermolecules in drug screening with the advantage of minimizing possibleinterference of reporter signal by different signal transductionpathways and other regulatory mechanisms.

Additionally, uses for the nucleic acid molecules of the inventioninclude fluorescence activated cell sorting (FACS), fluorescentmicroscopy, to detect and/or measure the level of gene expression invitro and in vivo, (e.g., to determine promoter strength), subcellularlocalization or targeting (fusion protein), as a marker, in calibration,in a kit, (e.g., for dual assays), for in vivo imaging, to analyzeregulatory pathways and genetic elements, and in multi-well formats.

With respect to synthetic DNA encoding luciferases, the use of syntheticclick beetle luciferases provides advantages such as the measurement ofdual reporters. As Renilla luciferase is better suited for in vivoimaging (because it does not depend on ATP or Mg²⁺ for reaction, unlikefirefly luciferase, and because coelenterazine is more permeable to thecell membrane than luciferin), the synthetic Renilla luciferase gene canbe employed in vivo. Further, the synthetic Renilla luciferase hasimproved fidelity and sensitivity in dual luciferase assays, e.g., forbiological analysis or in drug screening platform.

Demonstration of the Invention Using Luciferase Genes

The reporter genes for click beetle luciferase and Renilla luciferasewere used to demonstrate the invention because the reaction catalyzed bythe protein they encode are significantly easier to quantify than theproduct of most genes. However, for the purposes of demonstrating thepresent invention they represent genes in general.

Although the click beetle luciferase and Renilla luciferase genes sharethe name “luciferase”, this should not be interpreted to mean that theyoriginate from the same family of genes. The two luciferase proteins areevolutionarily distinct; they have fundamentally different traits andphysical structures, they use vastly different substrates (FIG. 17), andthey evolved from completely different families of genes. The clickbeetle luciferase is 61 kD in size, uses luciferin as a substrate andevolved from the CoA synthetases. The Renilla luciferase originates fromthe sea pansy Renilla Reniformis, is 35 kD in size, uses coelenterazineas a substrate and evolved from the αβ hydrolases. The only shared traitof these two enzymes is that the reaction they catalyze results in lightoutput. They are no more similar for resulting in light output than anyother two enzymes would be, for example, simply because the reactionthey catalyze results in heat.

Bioluminescence is the light produced in certain organisms as a resultof luciferase-mediated oxidation reactions. The luciferase genes, e.g.,the genes from luminous beetles, sea pansy, and, in particular, theluciferase from Photinus pyralis (the common firefly of North America),are currently the most popular luminescent reporter genes. Reference ismade to Bronstein et al. (1994) for a review of luminescent reportergene assays and to Wood (1995) for a review of the evolution of beetlebioluminescence. See FIG. 17 for an illustration of the reactionscatalyzed by each of firefly and click beetle luciferases (17A) andRenilla luciferase (17B).

Firefly luciferase and Renilla luciferase are highly valuable as geneticreporters due to the convenience, sensitivity and linear range of theluminescence assay. Today, luciferase is used in virtually every type ofexperimental biological system, including, but not limited to,prokaryotic and eukaryotic cell culture, transgenic plants and animals,and cell-free expression systems. The firefly luciferase enzyme isderived from a specific North American beetle, Photinus pyralis. Thefirefly luciferase enzyme and the click beetle luciferase enzyme aremonomeric proteins (61 kDa) which generate light through monooxygenationof beetle luciferin utilizing ATP and O₂ (FIG. 17A). The Renillaluciferase is derived from the sea pansy Renilla reniformis. The Renillaluciferase enzyme is a 36 kDa monomeric protein that utilizes O₂ andcoelenterazine to generate light (FIG. 17B).

The gene encoding firefly luciferase was cloned from Photinus pyralis,and demonstrated to produce active enzyme in E. coli (de Wet et al.,1987). The cDNA encoding firefly luciferase (luc) continues to gainfavor as the gene of choice for reporting genetic activity in animal,plant and microbial cells. The firefly luciferase reaction, modified bythe addition of CoA to produce persistent light emission, provides anextremely sensitive and rapid in vitro assay for quantifying fireflyluciferase expression in small samples of transfected cells or tissues.

To use firefly luciferase or click beetle luciferase as a geneticreporter, extracts of cells expressing the luciferase are mixed withsubstrates (beetle luciferin, Mg²⁺ ATP, and O₂), and luminescence ismeasured immediately. The assay is very rapid and sensitive, providinggene expression data with little effort. The conventional fireflyluciferase assay has been further improved by including coenzyme A inthe assay reagent to yield greater enzyme turnover and thus greaterluminescence intensity (Promega Luciferase Assay Reagent, Cat.# E1500,Promega Corporation, Madison, Wis.). Using this reagent, luciferaseactivity can be readily measured in luminometers or scintillationcounters. Firefly and click beetle luciferase activity can also bedetected in living cells in culture by adding luciferin to the growthmedium. This in situ luminescence relies on the ability of beetleluciferin to diffuse through cellular and peroxisomal membranes and onthe intracellular availability of ATP and O₂ in the cytosol andperoxisome.

Further, although reporter genes are widely used to measuretranscription events, their utility can be limited by the fidelity andefficiency of reporter expression. For example, in U.S. Pat. No.5,670,356, a firefly luciferase gene (referred to as luc+) was modifiedto improve the level of luciferase expression. While a higher level ofexpression was observed, it was not determined that higher expressionhad improved regulatory control.

The invention will be further described by the following nonlimitingexamples.

EXAMPLE 1 Synthetic Click Beetle (RD and GR) Luciferase Nucleic AcidMolecules

LucPplYG is a wild-type click beetle luciferase that emits yellow-greenluminescence (Wood, 1989). A mutant of LucPplYG named YG#81-6G01 wasenvisioned. YG#81-6G01 lacks a peroxisome targeting signal, has a lowerK_(M) for luciferin and ATP, has increased signal stability andincreased temperature stability when compared to the wild type(PCT/WO9914336). YG #81-6G01 was mutated to emit green luminescence bychanging Ala at position 224 to Val (A224V is a green-shiftingmutation), or to emit red luminescence by simultaneously introducing theamino acid substitutions A224H, S247H, N346I, and H348Q (red-shiftingmutation set) (PCT/WO9518853)

Using YG #81-6G01 as a parent gene, two synthetic gene sequences weredesigned. One codes for a luciferase emitting green luminescence (GR)and one for a luciferase emitting red luminescence (RD). Both genes weredesigned to 1) have optimized codon usage for expression in mammaliancells, 2) have a reduced number of transcriptional regulatory sitesincluding mammalian transcription factor binding sites, splice sites,poly(A) addition sites and promoters, as well as prokaryotic (E. coli)regulatory sites, 3) be devoid of unwanted restriction sites, e.g.,those which are likely to interfere with standard cloning procedures,and 4) have a low DNA sequence identity compared to each other in orderto minimize genetic rearrangements when both are present inside the samecell. In addition, desired sequences, e.g., a Kozak sequence orrestriction enzyme recognition sites, may be identified and introduced.

Not all design criteria could be met equally well at the same time. Thefollowing priority was established for reduction of transcriptionalregulatory sites: elimination of transcription factor (TF) binding sitesreceived the highest priority, followed by elimination of splice sitesand poly(A) addition sites, and finally prokaryotic regulatory sites.When removing regulatory sites, the strategy was to work from the lesserimportant to the most important to ensure that the most importantchanges were made last. Then the sequence was rechecked for theappearance of new lower priority sites and additional changes made asneeded. Thus, the process for designing the synthetic GR and RD genesequences, using computer programs described herein, involved 5optionally iterative steps that are detailed below

-   -   1. Optimized codon usage and changed A224V to create GRver1,        separately changed A224H, S247H, H348Q and N346I to create        RDver1. These particular amino acid changes were maintained        throughout all subsequent manipulations to the sequence.    -   2. Removed undesired restriction sites, prokaryotic regulatory        sites, splice sites, poly(A) sites thereby creating GRver2 and        RDver2.    -   3. Removed transcription factor binding sites (first pass) and        removed any newly created undesired sites as listed in step 2        above thereby creating GRver3 and RDver3.    -   4. Removed transcription factor binding sites created by step 3        above (second pass) and removed any newly created undesired        sites as listed in step 2 above thereby creating GRver4 and        RDver4.    -   5. Removed transcription factor binding sites created by step 4        above (third Pass) and confirmed absence of sites listed in step        2 above thereby creating GRver5 and RDver5.    -   6. Constructed the actual genes by PCR using synthetic        oligonucleotides corresponding to fragments of GRver5 and RDver5        designed sequences (FIGS. 6 and 10) thereby creating GR6 and        RD7. GR6, upon sequencing was found to have the serine residue        at amino acid position 49 mutated to an asparagine and the        proline at amino acid position 230 mutated to a serine (S49N,        P230S). RD7, upon sequencing was found to have the histidine at        amino acid position 36 mutated to a tyrosine (H36Y). These        changes occurred during the PCR process.    -   7. The mutations described in step 6 above (S49N, P230S for GR6        and H36Y for RD7) were reversed to create GRver5.1 and RDver5.1.    -   8. RDver5.1 was further modified by changing the arginine codon        at position 351 to a glycine codon (R351G) thereby creating        RDver5.2 with improved spectral properties compared to RDver5.1.    -   9. RDver5.2 was further mutated to increase luminescence        intensity thereby creating RD156-1H9 which encodes four        additional amino acid changes (M2I, S349T, K488T, E538V) and        three silent single base changes (SEQ ID NO:18).        1. Optimize Codon Usage and Introduce Mutations Determining        Luminescence Color

The starting gene sequence for this design step was YG #81-6G01 (SEQ IDNO:2).

a) Optimize Codon Usage:

The strategy was to adapt the codon usage for optimal expression inhuman cells and at the same time to avoid E. coli low-usage codons.Based on these requirements, the best two codons for expression in humancells for all amino acids with more than two codons were selected (seeWada et al., 1990). In the selection of codon pairs for amino acids withsix codons, the selection was biased towards pairs that have the largestnumber of mismatched bases to allow design of GR and RD genes withminimum sequence identity (codon distinction): Arg: CGC/CGT Leu: CTG/TTGSer: TCT/AGC Thr: ACC/ACT Pro: CCA/CCT Ala: GCC/GCT Gly: GGC/GGT Val:GTC/GTG Ile: ATC/ATTBased on this selection of codons, two gene sequences encoding theYG#81-6G01 luciferase protein sequence were computer generated. The twogenes were designed to have minimum DNA sequence identity and at thesame time closely similar codon usage. To achieve this, each codon inthe two genes was replaced by a codon from the limited list describedabove in an alternating fashion (e.g., Arg_((n)) is CGC in gene 1 andCGT in gene 2, Arg_((n+1)) is CGT in gene 1 and CGC in gene 2).

For subsequent steps in the design process it was anticipated thatchanges had to be made to this limited optimal codon selection in orderto meet other design criteria, however, the following low-usage codonsin mammalian cells were not used unless needed to meet criteria ofhigher priority: Arg: CGA Leu: CTA Ser: TCG Pro: CCG Val: GTA Ile: ATA

Also, the following low-usage codons in E. coli were avoided whenreasonable (note that 3 of these match the low-usage list for mammaliancells): Arg: CGA/CGG/AGA/AGG Leu: CTA Pro: CCC Ile: ATAb) Introduce Mutations Determining Luminescence Color:

Into one of the two codon-optimized gene sequences was introduced thesingle green-shifting mutation and into the other were introduced the 4red-shifting mutations as described above.

The two output sequences from this first design step were named GRver1(version 1 GR) and RDver1 (version 1 RD). Their DNA sequences are 63%identical (594 mismatches), while the proteins they encode differ onlyby the 4 amino acids that determine luminescence color (see FIGS. 2 and3 for an alignment of the DNA and protein sequences).

Tables 1 and 2 show, as an example, the codon usage for valine andleucine in human genes, the parent gene YG#81-6G01, the codon-optimizedsynthetic genes GRver1 and RDver1, as well as the final versions of thesynthetic genes after completion of step 5 in the design process (GRver5and RDver5). For a complete summary of the codon changes, see FIGS. 4and 5. TABLE 1 Valine RD Codon Human Parent GR ver1 RD ver1 GR ver5 ver5GTA 4 13 0 0 1 1 GTC 13 4 25 24 21 26 GTG 24 12 25 25 25 17 GTT 9 20 0 03 5

TABLE 2 Leucine RD Codon Human Parent GR ver1 RD ver1 GR ver5 ver5 CTA 35 0 0 0 0 CTC 12 4 0 1 12 11 CTG 24 4 28 27 19 18 CTT 6 12 0 0 1 1 TTA 317 0 0 0 0 TTG 6 13 27 27 23 252. Remove Undesired Restriction Sites, Prokaryotic Regulatory Sites,Splice Sites and Poly(A) Addition Sites

The starting gene sequences for this design step were GRver1 and RDver1.

a) Remove Undesired Restriction Sites:

To check for the presence and location of undesired restriction sites,the sequences of both synthetic genes were compared against a databaseof restriction enzyme recognition sequences (REBASE ver.712,http://www.neb.com/rebase) using standard sequence analysis software(GenePro ver 6.10, Riverside Scientific Ent.).

Specifically, the following restriction enzymes were classified asundesired:

-   -   BamH I, Xho I, Sfi I, Kpn I, Sac I, Mlu I, Nhe I, Sma I, Xho I,        Bgl II, Hind III, Nco I, Nar I, Xba I, Hpa I, Sal I,    -   other cloning sites commonly used: EcoR I, EcoR V, Cla I,    -   eight-base cutters (commonly used for complex constructs),    -   BstE II (to allow N-terminal fusions),    -   Xcm I (can generate A/T overhang used for T-vector cloning).        To eliminate undesired restriction sites when found in a        synthetic gene, one or more codons of the synthetic gene        sequence were altered in accordance with the codon optimization        guidelines described in 1a above.        b) Remove Prokaryotic (E. coli) Regulatory Sequences:

To check for the presence and location of prokaryotic regulatorysequences, the sequences of both synthetic genes were searched for thepresence of the following consensus sequences using standard sequenceanalysis software (GenePro):

-   -   TATAAT (−10 Pribnow box of promoter)    -   AGGA or GGAG (ribosome binding site; only considered if paired        with a methionine codon 12 or fewer bases downstream).        To eliminate such regulatory sequences when found in a synthetic        gene, one or more codons of the synthetic gene at sequence were        altered in accordance with the codon optimization guidelines        described in 1a above.        c) Remove Splice Sites:

To check for the presence and location of splice sites, the DNA strandcorresponding to the primary RNA transcript of each synthetic gene wassearched for the presence of the following consensus sequences (seeWatson et al., 1983) using standard sequence analysis software(GenePro):

-   -   splice donor site: AG|GTRAGT (exon|intron), the search was        performed for AGGTRAG and the lower stringency GGTRAGT;    -   splice acceptor site: (Y)_(n)NCAG|G (intron|exon), the search        was performed with n=1.        To eliminate splice sites found in a synthetic gene, one or more        codons of the synthetic gene sequence were altered in accordance        with the codon optimization guidelines described in 1a above.        Splice acceptor sites were generally difficult to eliminate in        one gene without introducing them into the other gene because        they tended to contain one of the two only Gln codons (CAG);        they were removed by placing the Gln codon CAA in both genes at        the expense of a slightly increased sequence identity between        the two genes.        d) Remove Poly(A) Addition Sites:

To check for the presence and location of poly(A) addition sites, thesequences of both synthetic genes were searched for the presence of thefollowing consensus sequence using standard sequence analysis software(GenePro): -  AATAAA.To eliminate each poly(A) addition site found in a synthetic gene, oneor more codons of the synthetic gene sequence were altered in accordancewith the codon optimization guidelines described in 1a above. The twooutput sequences from this second design step were named GRver2 andRDver2. Their DNA sequences are 63% identical (590 mismatches) (FIGS. 2and 3.3. Remove Transcription Factor (TF) Binding Sites, then Repeat Steps 2a-d

The starting gene sequences for this design step were GRver2 and RDver2.To check for the presence, location and identity of potential TF bindingsites, the sequences of both synthetic genes were used as querysequences to search a database of transcription factor binding sites(TRANSFAC v3.2). The TRANSFAC database(http://transfac.gbf.de/TRANSFAC/index:html) holds information on generegulatory DNA sequences (TF binding sites) and proteins (TFs) that bindto and act through them. The SITE table of TRANSFAC Release 3.2 contains4,401 entries of individual (putative) TF binding sites (including TFbinding sites in eukaryotic genes, in artificial sequences resultingfrom mutagenesis studies and in vitro selection procedures based onrandom oligonucleotide mixtures or specific theoretical considerations,and consensus binding sequences (from Faisst and Meyer, 1992)).

The software tool used to locate and display these TF binding sites inthe synthetic gene sequences was TESS (Transcription Element SearchSoftware, http://agave.humgen.upenn.edu/tess/index.html). The filteredstring-based search option was used with the following user-definedsearch parameters:

-   -   Factor Selection Attribute: Organism Classification    -   Search Pattern: Mammalia    -   Max. Allowable Mismatch %: 0    -   Min. element length: 5    -   Min. log-likelihood: 10        This parameter selection specifies that only mammalian TF        binding sites (approximately 1,400 of the 4,401 entries in the        database) that are at least 5 bases long will be included in the        search. It further specifies that only TF binding sites that        have a perfect match in the query sequence and a minimum log        likelihood (LLH) score of 10 will be reported. The LLH scoring        method assigns 2 to an unambiguous match, 1 to a partially        ambiguous match (e.g., A or T match W) and 0 to a match against        ‘N’. For example, a search with parameters specified above would        result in a “hit” (positive result or match) for TATAA (SEQ ID        NO:240) (LLH=10), STRATG (SEQ ID NO:241) (LLH=10), and MTTNCNNMA        (SEQ ID NO:242) (LLH=10) but not for TRATG (SEQ ID NO: 243)        (LLH=9) if these four TF binding sites were present in the query        sequence. A lower stringency test was performed at the end of        the design process to re-evaluate the search parameters.

When TESS was tested with a mock query sequence containing known TFbinding sites it was found that the program was unable to report matchesto sites ending with the 3′ end of the query sequence. Thus, an extranucleotide was added to the 3′ end of all query sequences to eliminatethis problem.

The first search for TF binding sites using the parameters describedabove found about 100 transcription factor binding sites (hits) for eachof the two synthetic genes (GRver2 and RDver2). All sites wereeliminated by changing one or more codons of the synthetic genesequences in accordance with the codon optimization guidelines describedin la above. However, it was expected that some these changes creatednew TF binding sites, other regulatory sites, and new restriction sites.Thus, steps 2 a-d were repeated as described, and 4 new restrictionsites and 2 new splice sites were removed. The two output sequences fromthis third design step were named GRver3 and RDver3. Their DNA sequencesare 66% identical (541 mismatches) (FIGS. 2 and 3).

4. Remove New Transcription Factor (TF) Binding Sites, then Repeat Steps2 a-d

The starting gene sequences for this design step were GRver3 and RDver3.This fourth step is an iteration of the process described in step 3. Thesearch for newly introduced TF binding sites yielded about 50 hits foreach of the two synthetic genes. All sites were eliminated by changingone or more codons of the synthetic gene sequences in general accordancewith the codon optimization guidelines described in 1a above. However,more high to medium usage codons were used to allow elimination of allTF binding sites. The lowest priority was placed on maintaining lowsequence identity between the GR and RD genes. Then steps 2 a-d wererepeated as described. The two output sequences from this fourth designstep were named GRver4 and RDver4. Their DNA sequences are 68% identical(506 mismatches) (FIGS. 2 and 3).

5. Remove New Transcription Factor (TF) Binding Sites, then Repeat Steps2 a-d

The starting gene sequences for this design step were GRver4 and RDver4.This fifth step is another iteration of the process described in step 3above. The search for new TF binding sites introduced in step 4 yieldedabout 20 hits for each of the two synthetic genes. All sites wereeliminated by changing one or more codons of the synthetic genesequences in general accordance with the codon optimization guidelinesdescribed in 1a above. However, more high to medium usage codons wereused (these are all considered “preferred”) to allow elimination of allTF binding sites. The lowest priority was placed on maintaining lowsequence identity between the GR and RD genes. Then steps 2 a-d wererepeated as described. Only one acceptor splice site could not beeliminated. As a final step the absence of all TF binding sites in bothgenes as specified in step 3 was confirmed. The two output sequencesfrom this fifth and last design step were named GRver5 and RDver5. TheirDNA sequences are 69% identical (504 mismatches) (FIGS. 2 and 3).

Additional Evaluation of GRver5 and RDver5

a) Use Lower Stringency Parameters for TESS:

The search for TF binding sites was repeated as described in step 3above, but with even less stringent user-defined parameters:

-   -   setting LLH to 9 instead of 10 did not result in new hits;    -   setting LLH to 0 through 8 (incl.) resulted in hits for two        additional sites, MAMAG (22 hits) and CTKTK (24 hits);    -   setting LLH to 8 and the minimum element length to 4, the search        yielded (in addition to the two sites above) different 4-base        sites for AP-1, NF-1, and c-Myb that are shortened versions of        their longer respective consensus sites which were eliminated in        steps 3-5 above.        It was not realistic to attempt complete elimination of these        sites without introduction of new sites, so no further changes        were made.        b) Search Different Database:

The Eukaryotic Promoter Database (release 45) contains information aboutreliably mapped transcription start sites (1253 sequences) of eukaryoticgenes. This database was searched using BLASTN 1.4.11 with defaultparameters (optimized to find nearly identical sequences rapidly; seeAltschul et al, 1990) at the National Center for BiotechnologyInformation site (http://www.ncbi.nlm.nih.gov/cgi-bin/BLAST). To testthis approach, a portion of pGL3-Control vector sequence containing theSV40 promoter and enhancer was used as a query sequence, yielding theexpected hits to SV40 sequences. No hits were found when using the twosynthetic genes as query sequences.

Summary of GRver5 and RDver5 Synthetic Gene Properties

Both genes, which at this stage were still only “virtual” sequences inthe computer, have a codon usage that strongly favors mammalianhigh-usage codons and minimizes mammalian and E. coli low-usage codons.FIG. 4 shows a summary of the codon usage of the parent gene and thevarious synthetic gene versions.

Both genes are also completely devoid of eukaryotic TF binding sitesconsisting of more than four unambiguous bases, donor and acceptorsplice sites (one exception: GRver5 contains one splice acceptor site),poly(A) addition sites, specific prokaryotic (E. coli) regulatorysequences, and undesired restriction sites.

The gene sequence identity between GRver5 and RDver5 is only 69% (504base mismatches) while their encoded proteins are 99% identical (4 aminoacid mismatches), see FIGS. 2 and 3. Their identity with the parentsequence YG#81-6G1 is 74% (GRver5) and 73% (RDver5), see FIG. 2. Theirbase composition is 49.9% GC (GRver5) and 49.5% GC (RDver5), compared to40.2% GC for the parent YG#81-6G01.

Construction of Synthetic Genes

The two synthetic genes were constructed by assembly from syntheticoligonucleotides in a thermocycler followed by PCR amplification of thefull-length genes (similar to Stemmer et al. (1995) Gene. 164, pp.49-53). Unintended mutations that interfered with the design goals ofthe synthetic genes were corrected.

a) Design of Synthetic Oligonucleotides:

The synthetic oligonucleotides were mostly 40mers that collectively codefor both complete strands of each designed gene (1,626 bp) plus flankingregions needed for cloning (1,950 bp total for each gene; FIG. 6). The5′ and 3′ boundaries of all oligonucleotides specifying one strand weregenerally placed in a manner to give an average offset/overlap of 20bases relative to the boundaries of the oligonucleotides specifying theopposite strand.

The ends of the flanking regions of both genes matched the ends of theamplification primers (pRAMtailup:5′-gtactgagacgacgccagcccaagcttaggcctgagtg SEQ ID NO:229, and pRAMtaildn:5′-ggcatgagcgtgaactgactgaactagcggccgccgag SEQ ID NO:230) to allowcloning of the genes into our E. coli expression vector pRAM(WO99/14336).

A total of 183 oligonucleotides were designed (FIG. 6): fifteenoligonucleotides that collectively encode the upstream and downstreamflanking sequences (identical for both genes; SEQ ID NOs: 35-49) and 168oligonucleotides (4×42) that encode both strands of the two genes (SEQID NOs: 50-217).

All 183 oligonucleotides were run through the hairpin analysis of theOLIGO software (OLIGO 4.0 Primer Analysis Software © 1989-1991 byWojciech Rychlik) to identify potentially detrimental intra-molecularloop formation. The guidelines for evaluating the analysis results wereset according to recommendations of Dr. Sims (Sigma-Genosys Custom GeneSynthesis Department): oligos forming hairpins with ΔG<−10 have to beavoided, those forming hairpins with ΔG≦−7 involving the 3′ end of theoligonucleotide should also be avoided, while those with an overallΔG≦−5 should not pose a problem for this application. The analysisidentified 23 oligonucleotides able to form hairpins with a ΔG between−7.1 and −4.9. Of these, 5 had blocked or nearly blocked 3′ ends (0-3free bases) and were re-designed by removing 1-4 bases at their 3′ endand adding it to the adjacent oligonucleotide.

The 40mer oligonucleotide covering the sequence complementary to thepoly(A) tail had a very low complexity 3′ end (13 consecutive T bases).An additional 40mer was designed with a high complexity 3′ end but aconsequently reduced overlap with one of its complementaryoligonucleotides (11 instead of 20 bases) on the opposite strand.

Even though the oligos were designed for use in a thermocycler-basedassembly reaction, they could also be used in a ligation-based protocolfor gene construction. In this approach, the oligonucleotides areannealed in a pairwise fashion and the resulting short double-strandedfragments are ligated using the sticky overhangs. However, this wouldrequire that all oligonucleotides be phosphorylated.

b) Gene Assembly and Amplification

In a first step, each of the two synthetic genes was assembled in aseparate reaction from 98 oligonucleotides. The total volume for eachreaction was 50 μl:

-   -   0.5 μM oligonucleotides (=0.25 pmoles of each oligo)    -   1.0 U Taq DNA polymerase    -   0.02 U Pfu DNA polymerase    -   2 mM MgCl₂    -   0.2 mM dNTPs (each)    -   0.1% gelatin    -   Cycling conditions: (94° C. for 30 seconds, 52° C. for 30        seconds, and 72° C. for 30 seconds)×55 cycles.

In a second step, each assembled synthetic gene was amplified in aseparate reaction. The total volume for each reaction was 50 μl:

-   -   2.5 l assembly reaction    -   5.0 U Taq DNA polymerase    -   0.1 U Pfu DNA polymerase    -   1 M each primer (pRAMtailup, pRAMtaildn)    -   2 mM MgCl₂    -   0.2 mM dNTPs (each)    -   Cycling conditions: (94° C. for 20 seconds, 65° C. for 60        seconds, 72° C. for 3 minutes)×30 cycles.

The assembled and amplified genes were subcloned into the pRAM vectorand expressed in E. coli, yielding 1-2% luminescent GR or RD clones.Five GR and five RD clones were isolated and analyzed further. Of thefive GR clones, three had the correct insert size, of which one wasweakly luminescent and one had an altered restriction pattern. Of thefive RD clones, two had the correct size insert with an alteredrestriction pattern and one of those was weakly luminescent. Overall,the analysis indicated the presence of a large number of mutations inthe genes, most likely the result of errors introduced in the assemblyand amplification reactions.

c) Corrective Assembly and Amplification

To remove the large number of mutations present in the full-lengthsynthetic genes we performed an additional assembly and amplificationreaction for each gene using the proof-reading DNA polymerase Tli. Theassembly reaction contained, in addition to the 98 GR or RDoligonucleotides, a small amount of DNA from the correspondingfull-length clones with mutations described above. This allows theoligos to correct mutations present in the templates.

The following assembly reaction was performed for each of the syntheticgenes. The total volume for each reaction was 50 μl:

-   -   0.5 μM oligonucleotides (=0.25 pmoles of each oligo)    -   0.016 pmol plasmid (mix of clones with correct insert size)    -   2.5 U Tli DNA polymerase    -   2 mM MgCl₂    -   0.2 mM dNTPs (each)    -   0.1% gelatin    -   Cycling conditions: 94° C. for 30 seconds, then (94° C. for 30        seconds, 52° C. for 30 seconds, 72° C. for 30 seconds) for 55        cycles, then 72° C. for 5 minutes.

The following amplification reaction was performed on each of theassembly reactions. The total volume for each amplification reaction was50 μl:

-   -   1-5 μl of assembly reaction    -   40 pmol each primer (pRAMtailup, pRAMtaildn)    -   2.5 U Tli DNA polymerase    -   2 mM MgCl₂    -   0.2 mM dNTPs (each)    -   Cycling conditions: 94° C. for 30 seconds, then (94° C. for 20        seconds, 65° C. for 60 seconds and 72° C. for 3 minutes) for 30        cycles, then 72° C. for 5 minutes.

The genes obtained from the corrective assembly and amplification stepwere subcloned into the pRAM vector and expressed in E. coli, yielding75% luminescent GR or RD clones. Forty-four GR and 44 RD clones wereanalyzed with our screening robot (WO99/14336). The six best GR and RDclones were manually analyzed and one best GR and RD clone was selected(GR6 and RD7). Sequence analysis of GR6 revealed two point mutations inthe coding region, both of which resulted in an amino acid substitution(S49N and P230S). Sequence analysis of RD7 revealed three pointmutations in the coding region, one of which resulted in an amino acidsubstitution (H36Y). It was confirmed that none of the silent pointmutations introduced any regulatory or restriction sites conflictingwith the overall design criteria for the synthetic genes.

d) Reversal of Unintended Amino Acid Substitutions

The unintended amino acid substitutions present in the GR6 and RD7synthetic genes were reversed by site-directed mutagenesis to match theGRver5 and RDver5 designed sequences, thereby creating GRver5.1 andRDver5.1. The DNA sequences of the mutated regions were confirmed bysequence analysis.

e) Improve Spectral Properties

The RDver5.1 gene was further modified to improve its spectralproperties by introducing an amino change (R351G), thereby creatingRDver5.2

pGL3 Vectors with RD and GR Genes

The parent click beetle luciferase YG#81-6G1 (“YG”), and the syntheticclick beetle luciferase genes GRver5.1 (“GR”), RDver5.2 (“RD”), andRD156-1H9 were cloned into the four pGL3 reporter vectors (PromegaCorp.):

-   -   pGL3-Basic=no promoter, no enhancer    -   pGL3-Control=SV40 promoter, SV40 enhancer    -   pGL3-Enhancer=SV40 enhancer (3′ to luciferase coding sequences)    -   pGL3-Promoter=SV40 promoter.

The primers employed in the assembly of GR and RD synthetic genesfacilitated the cloning of those genes into pRAM vectors. To introducethe genes into pGL3 vectors (Promega Corp., Madison, Wis.) for analysisin mammalian cells, each gene in a pRAM vector (pRAM RDver5.1, pRAMGRver5.1, and pRAM RD156-1H9) was amplified to introduce an Nco I siteat the 5′ end and an Xba I site at the 3′ end of the gene. The primersfor pRAM RDver5.1 and pRAM GRver5.1 were: (SEQ ID NO:231) GR→5′ GGA TCCCAT GGT GAA GCG TGA GAA 3′ or (SEQ ID NO:232) RD→5′ GGA TCC CAT GGT GAAACG CGA 3′ and (SEQ ID NO:233) 5′ CTA GCT TTT TTT TCT AGA TAA TCA TGAAGA C 3′

The primers for pRAM RD156-1H9 were: (SEQ ID NO:295) 5′ GCG TAG CCA TGGTAA AGC GTG AGA AAA ATG TC 3′ and (SEQ ID NO:296) 5′ CCG ACT CTA GAT TACTAA CCG CCG GCC TTC ACC 3′The PCR included:

-   -   100 ng DNA plasmid    -   1 μM primer upstream    -   1 μM primer downstream    -   0.2 mM dNTPs    -   1× buffer (Promega Corp.)    -   5 units Pfu DNA polymerase (Promega Corp.)    -   Sterile nanopure H₂O to 50 μl

The cycling parameters were: 94° C. for 5 minutes; (94° C. for 30seconds; 55° C. for 1 minute; and 72° C. for 3 minutes)×15 cycles. Thepurified PCR product was digested with Nco I and Xba I, ligated withpGL3-control that was also digested with Nco I and Xba I, and theligated products introduced to E. coli. To insert the luciferase genesinto the other pGL3 reporter vectors (basic, promoter and enhancer), thepGL3-control vectors containing each of the luciferase genes wasdigested with Nco I and Xba I, ligated with other pGL3 vectors that alsowere digested with Nco I and Xba I, and the ligated products introducedto E. coli. Note that the polypeptide encoded by GRver5.1 and RDver5.1(and RD 156-1 H9, see below) nucleic acid sequences in pGL3 vectors hasan amino acid substitution at position 2 to valine as a result of theNco I site at the initiation codon in the oligonucleotide.

Because of internal Nco I and Xba I sites, the native gene in YG#81-6G01 was amplified from a Hind III site upstream to a Hpa I sitedownstream of the coding region and which included flanking sequencesfound in the GR and RD clones. The upstream primer (5′-CAA AAA GCT TGGCAT TCC GGT ACT GTT GGT AAA GCC ACC ATG GTG AAG CGA GAG-3′; SEQ IDNO:234) and a downstream primer (5′-CAA TTG TTG TTG TTA ACT TGT TTATT-3′; SEQ ID NO:235) were mixed with YG#81-6G01 and amplified using thePCR conditions above. The purified PCR product was digested with Nco Iand Xba I, ligated with pGL3-control that was also digested with HindIII and Hpa I, and the ligated products introduced into E. coli. Toinsert YG#81-6G01 into the other pGL3 reporter vectors (basic, promoterand enhancer), the pGL3-control vectors containing YG#81-6G01 weredigested with Nco I and Xba I, ligated with the other pGL3 vectors thatalso were digested with Nco I and Xba I, and the ligated productsintroduced to E. coli. Note that the clone of YG#81-6G01 in the pGL3vectors has a C instead of an A at base 786, which yields a change inthe amino acid sequence at residue 262 from Phe to Leu (FIG. 2 shows thesequence of YG#81-6G01 prior to introduction into pGL3 vectors). Todetermine whether the altered amino acid at position 262 affected theenzyme biochemistry, the clone of YG#81-6G01 was mutated to resemble theoriginal sequence. Both clones were then tested for expression in E.coli, physical stability, substrate binding, and luminescence outputkinetics. No significant differences were found.

Partially purified enzymes expressed from the synthetic genes and theparent gene were employed to determine Km for luciferin and ATP (seeTable 3). TABLE 3 Enzyme K_(M) (LH₂) K_(M) (ATP) YG parent 2 μM 17 μM GR1.3 μM 25 μM RD 24.5 μM 46 μM

In vitro eukaryotic transcription/translation reactions were alsoconducted using Promega's TNT T7 Quick system according tomanufacturer's instructions. Luminescence levels were 1 to 37-fold and 1to 77-fold higher (depending on the reaction time) for the synthetic GRand RD genes, respectively, compared to the parent gene (corrected forluminometer spectral sensitivity).

To test whether the synthetic click beetle luciferase genes and the wildtype click beetle gene have improved expression in mammalian cells, eachof the synthetic genes and the parent gene was cloned into a series ofpGL3 vectors and introduced into CHO cells (Table 8). In all cases, thesynthetic click beetle genes exhibited a higher expression than thenative gene. Specifically, expression of the synthetic GR and RD geneswas 1900-fold and 40-fold higher, respectively, than that of the parent(transfection efficiency normalized by comparison to native Renillaluciferase gene). Moreover, the data (basic versus control vector) showthat the synthetic genes have reduced basal level transcription.

Further, in experiments with the enhancer vector where the percentage ofactivity in reference to the control is compared between the native andsynthetic gene, the data showed that the synthetic genes have reducedrisk of anomalous transcription characteristics. In particular, theparent gene appeared to contain one or more internal transcriptionalregulatory sequences that are activated by the enhancer in the vector,and thus is hot suitable as a reporter gene while the synthetic GR andRD genes showed a clean reporter response (transfection efficiencynormalized by comparison to native Renilla luciferase gene). See Table9.

The clone names and their corresponding SEQ ID numbers for nucleotidesequence and amino acid sequence are listed below in Table 4. TABLE 4SEQ ID SEQ ID Clone name Luciferase Type NO. NO. LUCPPLYG Wild type YGClick Beetle 1 23 YG#81-6G01 Mutant YG Click Beetle 2 24 GRver1Synthetic Green Click Beetle 3 25 GRver2 Synthetic Green Click Beetle 426 GRver3 Synthetic Green Click Beetle 5 27 GRver4 Synthetic Green ClickBeetle 6 28 GRver5 Synthetic Green Click Beetle 7 29 GR6 Synthetic GreenClick Beetle 8 30 GRver5.1 Synthetic Green Click Beetle 9 31 RDver1Synthetic Red Click Beetle 10 32 RDver2 Synthetic Red Click Beetle 11 33RDver3 Synthetic Red Click Beetle 12 34 RDver4 Synthetic Red ClickBeetle 13 218 RDver5 Synthetic Red Click Beetle 14 219 RD7 Synthetic RedClick Beetle 15 220 RDver5.1 Synthetic Red Click Beetle 16 221 RDver5.2Synthetic Red Click Beetle 17 222 RD156-1H9 Synthetic Red Click Beetle18 223 RELLUC Wild type Renilla 19 224 Rlucver1 Synthetic Renilla 20 225Rlucver2 Synthetic Renilla 21 226 Rluc-final Synthetic Renilla 22 227

EXAMPLE 2 Evolution of the RD Luciferase Gene

RDver5.2 was mutated to increase its luminescence intensity, therebycreating RD156-1H9 which carries four additional amino acid changes(M2I, S349T, K488T, E538V) and three silent point mutations (SEQ IDNO:18).

a) Site-Directed Mutagenesis:

The initial strategy was to use site-directed mutagenesis. There arefour amino acid differences between the GR and RD synthetic genes withH348Q providing the greatest contribution to red color. Thus, thissubstitution may also cause structural changes in the protein that couldlead to low light output. Optimization of positions near this area couldincrease light output. The following positions were selected formutagenesis:

-   -   1. S344 (at the edge of the binding pocket for        luciferin)—randomize this codon.    -   2. A245 (strictly conserved but closest to 348 and at the edge        of the active site pocket)—randomize this codon.    -   3. I347 (not conserved, next to 348 in sequence)—mutate to        hydrophobic amino acids only.    -   4. S349 (not conserved, next to 348 in sequence)—mutate to S, T,        A, P only.

Oligonucleotides designed to mutate the above positions were used in asite-directed mutagenesis experiment (WO99/14336) and the resultingmutants were screened for luminescence intensity. There was littlevariation in light intensity and only about 25% were luminescent. Formore detailed analysis, clones were picked and analyzed with thescreening robot (PCT/WO9914336). None of the clones had a luminescenceintensity (LI) higher than RDver5.2, but four of the clones had slightlylower composite Km for luciferin and ATP (Km).

b) Directed Evolution:

Protocols and procedures used for the directed evolution are detailed insee PCT/WO9914336. DNA from the four clones with lower Km was combinedand three libraries of random mutants were produced. The libraries werescreened with the robot and clones with the highest LI values wereselected. These clones were shuffled together and another robotic screenwas completed with an incubation temperature of 46° C. The three cloneswith the highest LI values were RD156-0B4, RD156-1A5, and RD156-1H9.

c) Analysis:

The three clones with the highest LI values were selected for manualanalysis to confirm that their luminescence intensity was higher thanthat of RDver5.2 and to ensure that their spectral properties were notcompromised. One of the clones was slightly green-shifted, all othersmaintained the spectral properties of RDver5.2 (Table 5). TABLE 5 ClonePeak (nm) Width (nm) RD156-0B4 616 68 RD156-1A5 614 70 RD156-1H9 618 69RDver5.2 (prep #1) 617 70 RDver5.2 (prep #2) 618 69

The Km values for luciferin and the luminescence intensity relative toRDver5.2 were determined for all three clones in several independentexperiments. All cells samples were processed with CCLR lysis buffer(E1483, Promega Corp., Madison, Wis.) and diluted 1:10 into buffer (25mM HEPES pH 7.8, 5% glycerol, 1 mg/ml BSA, 150 mM NaCl). Table 7summarizes the results (Lum: luminescence values were normalized tooptical density; measurements for independent experiments are separatedby forward slashes) from expression in bacterial cells. RD156-1H9, theclone with the highest luminescence intensity (5 to 10-fold increase)also has an about 2-fold higher Km for luciferin. TABLE 6 Lum(normalized Clone Km Luciferin [μM] to RDver5.2) RD156-0B4 8/10 2.2/2.5RD156-1A5 13/13 3.1/5.6 RD156-1H9 20/23/23 4/10.9/7.5 RDver5.2 (prep #1)12/14/14 RDver5.2 (prep #2) 40/50 GRver5.1 (prep #1) 0.5 64 GRver5.1(prep #2) 3

Table 7 shows a comparison between the luminescence intensities ofRD156-1H9, GRver5.1 and RDver5.2 normalized to GRver5.1 with and withoutcorrection for the spectral sensitivity of the luminometerphotomultiplier tube. With correction, the luminescence intensity ofclone RD156-1H9 was only about 2-fold lower than that of GRver5.1. Theluciferin Km for clone RD156-1H9 is approximately 40-fold higher thanGRver5.1. RD156-1H9 is thermostable at 50° C. for at least 2 hours.TABLE 7 Name No Correction With Correction RDver5.2 0.016 0.06 GRver5.11.000 1.00 RD156-1H9 0.116 0.45

Tables 8 and 9 show a comparison of luciferase expression levels in CHOcells. Table 8 shows the expression levels only from the control vectorsin comparison to the firefly luciferase gene (RLU=relative light units).Table 9 shows a comparison of the expression levels in all four pGL3vectors calculated as a percent of the expression level in pGL3-control.TABLE 8 Synthetic Click Beetle Gene Expression Control vector rluYG#81-6G01 177 GRver5.1 343,417 RDver5.1 7,161 RD156-1H9 20,802 FireFly488,016

TABLE 9 Synthetic Click Beetle Gene Expression Percent of control Vectorvector YG-control 100 RD-control 100 GR-control 100 RD156-1H9 control100 YG-basic 3.3 RD-basic 1.0 GR-basic 0.2 RD156-1H9 basic 0.3YG-promoter 4.2 RD-promoter 15.1 GR-promoter 5.7 RD156-1H9 promoter 15.5YG-enhancer 51.5 RD-enhancer 2.8 GR-enhancer 1.4 RD156-1H9 enhancer 0.3

EXAMPLE 3 Synthetic Renilla Luciferase Nucleic Acid Molecule

The synthetic Renilla luciferase genes prepared include 1) an introducedKozak sequence, 2) codon usage optimized for mammalian (human)expression, 3) a reduction or elimination of unwanted restriction sites,4) removal of prokaryotic regulatory sites (ribosome binding site andTATA box), 5) removal of splice sites and poly(A) addition sites, and 6)a reduction or elimination of mammalian transcriptional factor bindingsequences.

The process of computer-assisted design of synthetic Renilla luciferasegenes by iterative rounds of codon optimization and removal oftranscription factor binding sites and other regulatory sites as well asrestriction sites can be described in three steps:

-   1. Using the wild type Renilla luciferase gene as the parent gene,    codon usage was optimized, one amino acid was changed (T→A) to    generate a Kozak consensus sequence, and undesired restriction sites    were eliminated thereby creating synthetic gene Rlucver1.-   2. Remove prokaryotic regulatory sites, splice sites, poly(A) sites    and transcription factor (TF) binding sites (first pass). Then    remove newly created TF binding sites. Then remove newly created    undesired restriction enzyme sites, prokaryotic regulatory sites,    splice sites, and poly(A) sites without introducing new TF binding    sites. This thereby created Rlucver2.-   3. Change 3 bases of Rlucver2 thereby creating Rluc-final.-   4. The actual gene was then constructed from synthetic    oligonucleotides corresponding to the Rluc-final designed sequence.    All mutations resulting from the assembly or PCR process were    corrected. This gene is Rluc-final (SEQ ID NO:22) and encodes the    amino acid sequence of SEQ ID NO:227.    Codon Selection

Starting with the Renilla reniformis luciferase sequence in Genbank(Accession No. M63501, SEQ ID NO: 19), codons were selected based oncodon usage for optimal expression in human cells and to avoid E. colilow-usage codons. The best codon for expression in human cells (or thebest two codons if found at a similar frequency) was chosen for allamino acids with more than one codon (Wada et al., 1990): Arg: CGC Lys:AAG Leu: CTG Asn: AAC Ser: TCT/AGC Gln: CAG Thr: ACC His: CAC Pro:CCA/CCT Glu: GAG Ala: GCC Asp: GAC Gly: GGC Tyr: TAC Val: GTG Cys: TGCIle: ATC/ATT Phe: TTC

In cases where two codons were selected for one amino acid, they wereused in an alternating fashion. To meet other criteria for the syntheticgene, the initial optimal codon selection was modified to some extentlater. For example, introduction of a Kozak sequence required the use ofGCT for Ala at amino acid position 2 (see below).

The following low-usage codons in mammalian cells were not used unlessneeded: Arg: CGA, CGU; Leu: CTA, UUA; Ser: TCG; Pro: CCG; Val: GTA; andIle: ATA. The following low-usage codons in E. coli were also avoidedwhen reasonable (note that 3 of these match the low-usage list formammalian cells): Arg: CGA/CGG/AGA/AGG, Leu: CTA; Pro: CCC; Ile: ATA.

Introduction of Kozak Sequences

The Kozak sequence: 5′ aaccATGGCT 3′ (SEQ ID NO: 293) (the Nco I site isunderlined, the coding region is shown in capital letters) wasintroduced to the synthetic Renilla luciferase gene. The introduction ofthe Kozak sequence changes the second amino acid from Thr to Ala (GCT).

Removal of Undesired Restriction Sites

REBASE ver. 808 (updated Aug. 1, 1998; Restriction Enzyme Database;www.neb.com/rebase) was employed to identify undesirable restrictionsites as described in Example 1. The following undesired restrictionsites (in addition to those described in Example 1 were removedaccording to the process described in Example 1: EcoICR I, NdeI, NsiI,SphI, Spel, XmaI, PstI.

The version of Renilla luciferase (Rluc) which incorporates all thesechanges is Rlucver1.

Removal of Prokarvotic (E. coli) Regulatory Sequences Splice Sites, andPoly(A) Sites

The priority and process for eliminating transcription regulation siteswas as described in Example 1.

Removal of TF Binding Sites

The same process, tools, and criteria were used as described in Example1, however, the newer version 3.3 of the TRANSFAC database was employed.

After removing prokaryotic regulatory sequences, splice sites andpoly(A) sites from Rlucver1, the first search for TF binding sitesidentified about 60 hits. All sites were eliminated with the exceptionof three that could not be removed without altering the amino acidsequence of the synthetic Renilla gene:

-   -   1. site at position 63 composed of two codons for W (TGGTGG),        for CAC-binding protein T00076;    -   2. site at position 522 composed of codons for KMV (AAN ATG        GTN), for myc-DF1 T00517;    -   3. site at position 885 composed of codons for EMG (GAR ATG        GGN), for myc-DF1 T00517.        The subsequent second search for (newly introduced) TF binding        sites yielded about 20 hits. All new sites were eliminated,        leaving only the three sites described above. Finally, any newly        introduced restriction sites, prokaryotic regulatory sequences,        splice sites and poly(A) sites were removed without introducing        new TF binding sites if possible.

Rlucver2 was obtained (SEQ ID Nos. 21 and 226).

As in Example 1, lower stringency search parameters were specified forthe TESS filtered string search to further evaluate the syntheticRenilla gene.

With the LLH reduced from 10 to 9 and the minimum element length reducedfrom 5 to 4, the TESS filtered string search did not show any new hits.When, in addition to the parameter changes listed above, the organismclassification was expanded from “mammalia” to “chordata”, the searchyielded only four more TF binding sites. When the Min LLH was furtherreduced to between 8 and 0, the search showed two additional 5-basesites (MAMAG and CTKTK) which combined had four matches in Rlucver2, aswell as several 4-base sites. Also as in Example 1, Rlucver2 was checkedfor hits to entries in the EPD (Eukaryotic Promoter Database, Release45). Three hits were determined (one to Mus musculus promoter H-2Lˆd(Cell, 44, 261 (1986), one to Herpes Simplex Virus type 1 promoterb′g′2.7 kb, and one to Homo sapiens DHFR promoter (J. Mol. Biol., 176,169 (1984)). However, no further changes were made to Rlucver2.

Summary of Properties for Rlucver2

-   -   All 30 low usage codons were eliminated. The introduction of a        Kozak sequence changed the second amino acid from Thr to Ala;    -   base composition: 55.7% GC (Renilla wild-type parent gene:        36.5%);    -   one undesired restriction site could not be eliminated: EcoR V        at position 488;    -   the synthetic gene had no prokaryotic promoter sequence but one        potentially functional ribosome binding site (RBS) at positions        867-73 (about 13 bases upstream of a Met codon) could not be        eliminated;    -   all poly(A) addition sites were eliminated;    -   splice sites: 2 donor splice sites could not be eliminated (both        share the amino acid sequence MGK);    -   TF sites: all sites with a consensus of>4 unambiguous bases were        eliminated (about 280 TF binding sites were removed) with 3        exceptions due to the preference to avoid changes to the amino        acid sequence.        Synthetic Renilla luciferase sequences are shown in FIGS. 7        and 8. A codon usage comparison is shown in FIG. 9.

When introduced into pGL3, Rluc-final has a Kozak sequence (CACCATGGCT).The changes in Rluc-final relative to Rlucver2 were introduced duringgene assembly. One change was at position 619, a C to an A, whicheliminated a eukaryotic promoter sequence and reduced the stability of ahairpin structure in the corresponding oligonucleotide employed toassemble the gene. Other changes included a change from CGC to AGA atpositions 218-220 (resulted in a better oligonucleotide for PCR).

Gene Assembly Strategy

The gene assembly protocol employed for the synthetic Renilla luciferasewas similar to that described in Example 1. The oligonucleotidesemployed are shown in FIG. 10. (SEQ ID NO:236) Sense Strand primer:5′ AACCATGGCTTCCAAGGTGTACGACCCCGAGCAACGCAAA 3′ (SEQ ID NO:237)Anti-sense Strand primer: 5′ GCTCTAGAATTACTGCTCGTTCTTCAGCACGCGCTCCACG 3′

The resulting synthetic gene fragment was cloned into a pRAM vectorusing Nco I and Xba I. Two clones having the correct size insert weresequenced. Four to six mutations were found in the synthetic gene fromeach clone. These mutations were fixed by site-directed mutagenesis(Gene Editor from Promega Corp., Madison, Wis.) and swapping the correctregions between these two genes. The corrected gene was confirmed bysequencing.

Other Vectors

To prepare an expression vector for the synthetic Renilla luciferasegene in a pGL-3 control vector backbone, 5 μg of pGL3-control wasdigested with Nco I and Xba I in 50 μl final volume with 2 μl of eachenzyme and 5 μl 10× buffer B (nanopure water was used to fill the volumeto 50 μl). The digestion reaction was incubated at 37° C. for 2 hours,and the whole mixture was run on a 1% agarose gel in 1×TAE. The desiredvector backbone fragment was purified using Qiagen's QIAquick gelextraction kit.

The native Renilla luciferase gene fragment was cloned into pGL3-controlvector using two oligonucleotides, Nco I-RL-F and Xba I-RL-R, to PCRamplify native Renilla luciferase gene using pRL-CMV as the template.The sequence for Nco I-RL-F is 5′-CGCTAGCCATGGCTTCGAAAGTTTATGATCC-3′(SEQ ID NO:238); the sequence for Xba I-RL-R is 5′GGCCAGTAACTCTAGAATTATTGTT-3′ (SEQ ID NO:239). The PCR reaction wascarried out as follows:

Reaction mixture (for 100 μl): DNA template (Plasmid)  1.0 μl (1.0 ng/μlfinal) 10× Rec. Buffer 10.0 μl (Stratagene Corp.) dNTPs (25 mM each) 1.0 μl (final 250 μM) Primer 1 (10 μM)  2.0 μl (0.2 μM final) Primer 2(10 μM)  2.0 μl (0.2 μM final) Pfu DNA Polymerase  2.0 μl (2.5 U/μl,Stratagene Corp.) 82.0 μl double distilled water

-   PCR Reaction: heat 94° C. for 2 minutes; (94° C. for 20 seconds;    65° C. for 1 minute; 72° C. for 2 minutes; then 72° C. for 5    minutes)×25 cycles, then incubate on ice. The PCR amplified fragment    was cut from a gel, and the DNA purified and stored at −20° C.

To introduce native Renilla luciferase gene fragment into pGL3-controlvector, 5 μg of the PCR product of the native Renilla luciferase gene(RAM-RL-synthetic) was digested with Nco I and Xba I. The desiredRenilla luciferase gene fragment was purified and stored at −20° C.

Then 100 ng of insert and 100 ng of pGL3-control vector backbone weredigested with restriction enzymes Nco I and Xba I and ligated together.Then 2 μl of the ligation mixture was transformed into JM109 competentcells. Eight ampicillin resistance clones were picked and their DNAisolated. DNA from each positive clone of pGL3-control-native andpGL3-control-synthetic was purified. The correct sequences for thenative gene and the synthetic gene in the vectors were confirmed by DNAsequencing.

To determine whether the synthetic Renilla luciferase gene has improvedexpression in mammalian cells, the gene was cloned into the mammalianexpression vector pGL3-control vector under the control of SV40 promoterand SV40 early enhancer (FIG. 13A). The native Renilla luciferase genewas also cloned into the pGL-3 control vector so that the expressionfrom synthetic gene and the native gene could be compared. Theexpression vectors were then transfected into four common mammalian celllines (CHO, NIH3T3, Hela and CV-1; Table 10), and the expression levelscompared between the vectors with the synthetic gene versus the nativegene. The amount of DNA used was at two different levels to ascertainthat expression from the synthetic gene is consistently increased atdifferent expression levels. The results show a 70-600 fold increase ofexpression for the synthetic Renilla luciferase gene in these cells(Table 10). TABLE 10 Enhanced Synthetic Renilla Gene Expression CellType Amount Vector Fold Expression Increase CHO 0.2 μg 142 2.8 μg 145NIH3T3 0.2 μg 326 2.0 μg 593 HeLa 0.2 μg 185 1.0 μg 103 CV-1 0.2 μg 682.0 μg 72

One important advantage of luciferase reporter is its short proteinhalf-life. The enhanced expression could also result from extendedprotein half-life and, if so, this gives an undesired disadvantage ofthe new gene. This possibility is ruled out by a cycloheximide chase(“CHX Chase”) experiment (FIG. 14), which demonstrated that there was noincrease of protein half-life resulted from the humanized Renillaluciferase gene.

To ensure that the increase in expression is not limited to oneexpression vector backbone, is promoter specific and/or cell specific, asynthetic Renilla gene (Rluc-final) as well as native Renilla gene werecloned into different vector backbones and under different promoters(FIG. 13B). The synthetic gene always exhibited increased expressioncompared to its wild-type counterpart (Table 11). TABLE 11 Renilla GeneExpression: native v. synthetic (Rluc-final) Vector NIH-3T3 HeLa CHOpRL-tk, native 3,834.6 922.4 7,671.9 pRL-tk, synthetic 13,252.5 9,040.241,743.5 pRL-CMV, native 168,062.2 842,482.5 153,539.5 pRL-CMV,synthetic 2,168,129 8,440,306 2,532,576 pRL-SV40, native 224,224.4346,787.6 85,323.6 pRL-SV40, synthetic 1,469,588 2,632,510 1,422,830pRL-null, native 2,853.8 431.7 2,434 pRL-null, synthetic 9,151.17 2,43928,317.1 pRGL3b, native 12 21.8 17 pRGL3b, synthetic 130.5 212.4 1,094.5pRGL3-tk, native 27.9 155.5 186.4 pRGL3-tk, synthetic 6,778.2 8,782.59,685.9 pRL-tk no intron, native 31.8 165 93.4 pRL-tk no intron,synthetic 6,665.5 6,379 21,433.1

TABLE 12 Renilla Luciferase Expression in Mammalian Cells Percent ofcontrol vector Vector CHO cells NIH3T3 cells HeLa cells pRL-controlnative 100 100 100 pRL-control synthetic 100 100 100 pRL-basic native4.1 5.6 0.2 pRL-basic synthetic 0.4 0.1 0.0 pRL-promoter native 5.9 7.80.6 pRL-promoter synthetic 15.0 9.9 1.1 pRL-enhancer native 42.1 123.952.7 pRL-enhancer synthetic 2.6 1.5 5.4(Vector Backbones Illustrated in FIG. 13A)

With reduced spurious expression the synthetic gene should exhibit lessbasal level transcription in a promoterless vector. The synthetic andnative Renilla luciferase genes were cloned into the pGL3-basic vectorto compare the basal level of transcription. Because the synthetic geneitself has increased expression efficiency, the activity from thepromoterless vector cannot be compared directly to judge the differencein basal transcription, rather, this is taken into consideration bycomparing the percentage of activity from the promoterless vector inreference to the control vector (expression from the basic vectordivided by the expression in the fully functional expression vector withboth promoter and enhancer elements). The data demonstrate that thesynthetic Renilla luciferase has a lower level of basal transcriptionthan the native gene (Table 12)

It is well known to those skilled in the art that an enhancer cansubstantially stimulate promoter activity. To test whether the syntheticgene has reduced risk of inappropriate transcriptional characteristics,the native and synthetic gene were introduced into a vector with anenhancer element (pGL3-enhancer vector). Because the synthetic gene hashigher expression efficiency, the activity of both cannot be compareddirectly to compare the level of transcription in the presence of theenhancer, however, this is taken into account by using the percentage ofactivity from enhancer vector in reference to the control vector(expression in the presence of enhancer divided by the expression in thefully functional expression vector with both promoter and enhancerelements). Such results show that when native gene is present, theenhancer alone is able to stimulate transcription from 42-124% of thecontrol, however, when the native gene is replaced by the synthetic genein the same vector, the activity only constitutes 1-5% of the value whenthe same enhancer and a strong SV40 promoter are employed. This clearlydemonstrates that synthetic gene has reduced risk of spurious expression(Table 12).

The synthetic Renilla gene (Rluc-final) was used in in vitro systems tocompare translation efficiency with the native gene. In a T7 quickcoupled transcription/translation system (Promega Corp., Madison, Wis.),pRL-null native plasmid (having the native Renilla luciferase gene underthe control of the T7 promoter) or the same amount of pRL-null-syntheticplasmid (having the synthetic Renilla luciferase gene under the controlof the T7 promoter) was added to the TNT reaction mixture and luciferaseactivity measured every 5 minutes up to 60 minutes. Dual Luciferaseassay kit (Promega Corp.) was used to measure Renilla luciferaseactivity. The data showed that improved expression was obtained from thesynthetic gene (FIGS. 15A, B). To further evidence the increasedtranslation efficiency of the synthetic gene, RNA was prepared by an invitro transcription system, then purified. pRL-null (native orsynthetic) vectors were linearized with BamH I. The DNA was purified bymultiple phenol-chloroform extraction followed by ethanol precipitation.An in vitro T7 transcription system was employed by prepare RNAs. TheDNA template was removed by using RNase-free DNase, and RNA was purifiedby phenol-chloroform extraction followed by multiple isopropanolprecipitations. The same amount of purified RNA, either for thesynthetic gene or the native gene, was then added to a rabbitreticulocyte lysate (FIGS. 15C, D) or wheat germ lysate (FIGS. 15E, F).Again, the synthetic Renilla luciferase gene RNA produced moreluciferase than the native one. These data suggest that the translationefficiency is improved by the synthetic sequence. To determine why thesynthetic gene was highly expressed in wheat germ, plant codon usage wasdetermined. The lowest usage codons in higher plants coincided withthose in mammals.

Reporter gene assays are widely used to study transcriptional regulationevents. This is often carried out in co-transfection experiments, inwhich, along with the primary reporter construct containing the testingpromoter, a second control reporter under a constitutive promoter istransfected into cells as an internal control to normalize experimentalvariations including transfection efficiencies between the samples.Control reporter signal, potential promoter cross talk between thecontrol reporter and primary reporter, as well as potential regulationof the control reporter by experimental conditions, are importantaspects to consider for selecting a reliable co-reporter vector.

As described above, vector constructs were made by cloning syntheticRenilla luciferase gene into different vector backbones under differentpromoters. All the constructs showed higher expression in the threemammalian cell lines tested (Table 11). Thus, with better expressionefficiency, the synthetic Renilla luciferase gives out higher signalwhen transfected into mammalian cells.

Because a higher signal is obtained, less promoter activity is requiredto achieve the same reporter signal, this reduced risk of promoterinterference. CHO cells were transfected with 50 ng pGL3-control(firefly luc+) plus one of 5 different amounts of native pRL-TK plasmid(50, 100, 500, 1000, or 2000 ng) or synthetic pRL-TK (5, 10, 50, 100, or200 ng). To each transfection, pUC19 carrier DNA was added to a total of3 μg DNA. Shown in FIG. 16 is the experiment demonstrating that 10 foldless pRL-TK DNA gives similar or more signal as the native gene, withreduced risk of inhibiting expression from the primary reporterpGL3-control.

Experimental treatment sometimes may activate cryptic sites within thegene and cause induction or suppression of the co-reporter expression,which would compromise its function as co-reporter for normalization oftransfection efficiencies. One example is that TPA induces expression ofco-reporter vectors harboring the wild-type gene when transfecting MCF-7cells. 500 ng pRL-TK (native), 5 μg native and synthetic pRG-B, 2.5 μgnative and synthetic pRG-TK were transfected per well of MCF-7 cells.100 ng/well pGL3-control (firefly luc+) was co-transfected with all RLplasmids. Carrier DNA, pUC19, was used to bring the total DNAtransfected to 5.1 μg/well. 15.3 μl TransFast Transfection Reagent(Promega Corp., Madison, Wis.) was added per well. Sixteen hours later,cells were trypsinized, pooled and split into six wells of a 6-well dishand allowed to attach to the well for 8 hours. Three wells were thentreated with the 0.2 nM of the tumor promoter, TPA(phorbol-12-myristate-13-acetate, Calbiochem #524400-S), and three wellswere mock treated with 20 μl DMSO. Cells were harvested with 0.4 mlPassive Lysis Buffer 24 hours post TPA addition. The results showed thatby using the synthetic gene, undesirable change of co-reporterexpression by experimental stimuli can be avoided (Table 13). Thisdemonstrates that using synthetic gene can reduce the risk of anomalousexpression. TABLE 13 TPA Induction Vector Rlu Fold Induction pRL-tkuntreated (native) 184 pRL-tk TPA treated (native) 812 4.4 pRG-Buntreated (native) 1 pRG-B TPA treated (native) 8 8.0 pRG-B untreated(final) 132 pRG-B TPA treated (final) 195 1.47 pRG-tk untreated (native)44 pRG-tk TPA treated (native) 192 4.36 pRG-tk untreated (final) 12,816pRG-tk TPA treated (final) 11,347 0.88

REFERENCES

-   Altschul et al., Nucl. Acids Res., 25, 3389 (1997).-   Aota et al., Nucl. Acids Res., 16, 315 (1988).-   Boshart et al., Cell, 41, 521 (1985).-   Bronstein et al., Cal. Biochem., 219, 169 (1994).-   Corpet et al., Nucl. Acids Res., 16, 881 (1988).-   deWet et al., Mol. Cell. Biol., 7, 725 (1987).-   Dijkema et al., EMBO J., 4, 761 (1985).-   Faist and Meyer, Nucl. Acids Res., 20, 26 (1992).-   Gorman et al., Proc. Natl. Acad. Sci. USA, 79, 6777 (1982).-   Higgins et al., Gene, 73, 237 (1985).-   Higgins et al., CABIOS, 5, 151 (1989).-   Huang et al., CABIOS, 8, 155 (1992).-   Itolcik et al., PNAS, 94, 12410 (1997).-   Johnson et al., Mol. Reprod. Devel., 50, 377 (1998).-   Jones et al., Mol. Cell. Biol., 17, 6970 (1997).-   Karlin and Altschul, Proc. Natl. Acad. Sci. USA, 87, 2264 (1990).-   Karlin and Altschul, Proc. Natl. Acad. Sci. USA, 90, 5873 (1993).-   Keller et al., J. Cell Biol., 84, 3264 (1987).-   Kim et al., Gene, 91, 217 (1990).-   Lamb et al., Mol. Reprod. Devel., 51, 218 (1998).-   Mariatis et al., Science, 236, 1237 (1987).-   Michael et al., EMBO. J., 9, 481 (1990).-   Mizushima and Nagata, Nucl. Acids Res., 18, 5322 (1990).-   Murray et al., Nucl. Acids Res., 17, 477 (1989).-   Myers and Miller, CABIOS, 4, 11 (1988).-   Needleman and Wunsen, J. Mol. Biol., 48, 443 (1970).-   Pearson and Lipman, Proc. Natl. Acad. Sci. USA, 85, 2444 (1988).-   Pearson et al., Meth. Mol. Biol., 24, 307 (1994).-   Sharp et al., Nucl. Acids Res., 16, 8207 (1988).-   Sharp et al., Nucl. Acids Res., 15, 1281 (1987).-   Smith and Waterman, Adv. Appl. Math., 2, 482 (1981).-   Stemmer et al., Gene, 164, 49 (1995).-   Uetsuki et al., J. Biol. Chem., 264, 5791 (1989).-   Voss et al., Trends Biochem. Sci., 11, 287 (1986).-   Wada et al., Nucl. Acids Res., 18, 2367 (1990).-   Watson et al, eds. Recombinant DNA: A Short Course, Scientific    American Books, W. H. Freeman and Company, New York (1983).-   Wood, K. Photochemistry and Photobiology, 62, 662 (1995).-   Wood, K. Science 244, 700 (1989)

All publications, patents and patent applications are incorporatedherein by reference. While in the foregoing specification, thisinvention has been described in relation to certain preferredembodiments thereof, and many details have been set forth for purposesof illustration, it will be apparent to those skilled in the art thatthe invention is susceptible to additional embodiments and that certainof the details herein may be varied considerably without departing fromthe basic principles of the invention.

1-47. (canceled)
 48. A method to prepare a synthetic nucleic acidmolecule comprising an open reading frame, comprising: a) altering aplurality of transcription regulatory sequences in a parent nucleic acidsequence which encodes a polypeptide having at least 100 amino acids toyield a synthetic nucleic acid molecule which has at least 3-fold fewertranscription regulatory sequences relative to the parent nucleic acidsequence, wherein the transcription regulatory sequences are selectedfrom the group consisting of transcription factor binding sequences,intron splice sites, poly(A) addition sites, enhancer sequences andpromoter sequences; and b) altering greater than 25% of the codons inthe synthetic nucleic acid sequence which has a decreased number oftranscription regulatory sequences to yield a further synthetic nucleicacid molecule, wherein the codons which are altered do not result in anincreased number of transcription regulatory sequences, wherein thefurther synthetic nucleic acid molecule encodes a polypeptide with atleast 85% amino acid sequence identity to the polypeptide encoded by theparent nucleic acid sequence.
 49. A method to prepare a syntheticnucleic acid molecule comprising an open reading frame, comprising: a)altering greater than 25% of the codons in a parent nucleic acidsequence which encodes a polypeptide having at least 100 amino acids toyield a codon-altered synthetic nucleic acid molecule, and b) altering aplurality of transcription regulatory sequences in the codon-alteredsynthetic nucleic acid molecule to yield a further synthetic nucleicacid molecule which has at least 3-fold fewer transcription regulatorysequences relative to a synthetic nucleic acid molecule with a randomselection of codons at the codons which differ, wherein thetranscription regulatory sequences are selected from the groupconsisting of transcription factor binding sequences, intron splicesites, poly(A) addition sites, enhancer sequences and promotersequences, and wherein the further synthetic nucleic acid moleculeencodes a polypeptide with at least 85% amino acid sequence identity tothe polypeptide encoded by the parent nucleic acid sequence.
 50. Themethod of claim 48 or 49 wherein the parent nucleic acid sequenceencodes a reporter molecule.
 51. The method of claim 48 or 49 whereinthe parent nucleic acid sequence encodes a luciferase.
 52. The method ofclaim 48 or 49 wherein the synthetic nucleic acid molecule hybridizesunder medium stringency hybridization conditions to the parent nucleicacid sequence.
 53. The method of claim 48 or 49 wherein the codons whichare altered encode the same amino acid as the corresponding codons inthe parent nucleic acid sequence.
 54. (canceled)
 55. A method forpreparing at least two synthetic nucleic acid molecules which are codondistinct versions of a parent nucleic acid sequence which encodes apolypeptide, comprising: a) altering a parent nucleic acid sequence toyield a synthetic nucleic acid molecule having an increased number of afirst plurality of codons that are employed more frequently in aselected host cell relative to the number of those codons in the parentnucleic acid sequence; and b) altering the parent nucleic acid sequenceto yield a further synthetic nucleic acid molecule having an increasednumber of a second plurality of codons that are employed more frequentlyin the host cell relative to the number of those codons in the parentnucleic acid sequence, wherein the first plurality of codons isdifferent than the second plurality of codons, and wherein the syntheticand the further synthetic nucleic acid molecules encode the samepolypeptide.
 56. The method of claim 55 further comprising altering aplurality of transcription regulatory sequences in the synthetic nucleicacid molecule, the further synthetic nucleic acid molecule, or both, toyield at least one yet further synthetic nucleic acid molecule which hasat least 3-fold fewer transcription regulatory sequences relative to thesynthetic nucleic acid molecule, the further synthetic nucleic acidmolecule, or both.
 57. The method of claim 55 further comprisingaltering at least one codon in the first synthetic sequence to yield afirst modified synthetic sequence which encodes a polypeptide with atleast one amino acid substitution relative to the polypeptide encoded bythe first synthetic nucleic acid sequence.
 58. The method of claim 56further comprising altering at least one codon in the second syntheticsequence to yield a second modified synthetic sequence which encodes apolypeptide with at least one amino acid substitution relative to thepolypeptide encoded by the first synthetic nucleic acid sequence. 59.The method of claim 55 wherein the synthetic sequences encode aluciferase. 60-64. (canceled)
 65. The method of claim 48 or 49 furthercomprising altering the further synthetic nucleic acid molecule toencode a polypeptide having at least one amino acid substitutionrelative to the polypeptide encoded by the parent nucleic acid sequence.66. The method of claim 48 or 49 wherein the altering of transcriptionregulatory sequences does not introduce amino acid substitutions to thepolypeptide encoded by the synthetic nucleic acid molecule.